From 7b98b8b9b6507a1ef23541a6f896f7c7c3ebd6cd Mon Sep 17 00:00:00 2001 From: suraj-vathsa <85908731+suraj-vathsa@users.noreply.github.com> Date: Fri, 15 Dec 2023 11:40:32 -0800 Subject: [PATCH] Suraj/update triton main (#1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Changed copyright (#5705) * Modify timeout test in L0_sequence_batcher to use portable backend (#5696) * Modify timeout test in L0_sequence_batcher to use portable backend * Use identity backend that is built by default on Windows * updated upstream container name (#5713) * Fix triton container version (#5714) * Update the L0_model_config test expected error message (#5684) * Use better value in timeout test L0_sequence_batcher (#5716) * Use better value in timeout test L0_sequence_batcher * Format * Update JAX install (#5613) * Add notes about socket usage to L0_client_memory_growth test (#5710) * Check TensorRT error message more granularly (#5719) * Check TRT err msg more granularly * Clarify source of error messages * Consolidate tests for message parts * Pin Python Package Versions for HTML Document Generation (#5727) * updating with pinned versions for python dependencies * updated with pinned sphinx and nbclient versions * Test full error returned when custom batcher init fails (#5729) * Add testing for batcher init failure, add wait for status check * Formatting * Change search string * Add fastertransformer test (#5500) Add fastertransformer test that uses 1GPU. * Fix L0_backend_python on Jetson (#5728) * Don't use mem probe in Jetson * Clarify failure messages in L0_backend_python * Update copyright * Add JIRA ref, fix _test_jetson * Add testing for Python custom metrics API (#5669) * Add testing for python custom metrics API * Add custom metrics example to the test * Fix for CodeQL report * Fix test name * Address comment * Add logger and change the enum usage * Add testing for Triton Client Plugin API (#5706) * Add HTTP client plugin test * Add testing for HTTP asyncio * Add async plugin support * Fix qa container for L0_grpc * Add testing for grpc client plugin * Remove unused imports * Fix up * Fix L0_grpc models QA folder * Update the test based on review feedback * Remove unused import * Add testing for .plugin method * Install jemalloc (#5738) * Add --metrics-address and testing (#5737) * Add --metrics-address, add tests to L0_socket, re-order CLI options for consistency * Use non-localhost address * Add testing for basic auth plugin for HTTP/gRPC clients (#5739) * Add HTTP basic auth test * Add testing for gRPC basic auth * Fix up * Remove unused imports * Add multi-gpu, multi-stream testing for dlpack tensors (#5550) * Add multi-gpu, multi-stream testing for dlpack tensors * Update note on SageMaker MME support for ensemble (#5723) * Run L0_backend_python subtests with virtual environment (#5753) * Update 'main' to track development of 2.35.0 / r23.06 (#5764) * Include jemalloc into the documentation (#5760) * Enhance tests in L0_model_update (#5724) * Add model instance name update test * Add gap for timestamp to update * Add some tests with dynamic batching * Extend supported test on rate limit off * Continue test if off mode failed * Fix L0_memory_growth (#5795) (1) reduce MAX_ALLOWED_ALLOC to be more strict for bounded tests, and generous for unbounded tests. (2) allow unstable measurement from PA. (3) improve logging for future triage * Add note on --metrics-address (#5800) * Add note on --metrics-address * Copyright * Minor fix for running "mlflow deployments create -t triton --flavor triton ..." (#5658) UnboundLocalError: local variable 'meta_dict' referenced before assignment The above error shows in listing models in Triton model repository * Adding test for new sequence mode (#5771) * Adding test for new sequence mode * Update option name * Clean up testing spacing and new lines * MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL (#5686) * MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL Signed-off-by: Xiaodong Ye * Update the function order of config.py and use os.path.join to replace filtering a list of strings then joining Signed-off-by: Xiaodong Ye * Update onnx flavor to support s3 prefix and custom endpoint URL Signed-off-by: Xiaodong Ye * Fix two typos in MLFlow Triton plugin README.md Signed-off-by: Xiaodong Ye * Address review comments (replace => strip) Signed-off-by: Xiaodong Ye * Address review comments (init regex only for s3) Signed-off-by: Xiaodong Ye * Remove unused local variable: slash_locations Signed-off-by: Xiaodong Ye --------- Signed-off-by: Xiaodong Ye * Fix client script (#5806) * Add MLFlow test for already loaded models. Update copyright year (#5808) * Use the correct gtest filter (#5824) * Add error message test on S3 access decline (#5825) * Add test on access decline * Fix typo * Add MinIO S3 access decline test * Make sure bucket exists during access decline test * Restore AWS_SECRET_ACCESS_KEY on S3 local test (#5832) * Restore AWS_SECRET_ACCESS_KEY * Add reason for restoring keys * nnshah1 stream infer segfault fix (#5842) match logic from infer_handler.cc * Remove unused test (#5851) * Add and document memory usage in statistic protocol (#5642) * Add and document memory usage in statistic protocol * Fix doc * Fix up * [DO NOT MERGE Add test. FIXME: model generation * Fix up * Fix style * Address comment * Fix up * Set memory tracker backend option in build.py * Fix up * Add CUPTI library in Windows image build * Add note to build with memory tracker by default * use correct lib dir on CentOS (#5836) * use correct lib dir on CentOS * use new location for opentelemetry-cpp * Document that gpu-base flag is optional for cpu-only builds (#5861) * Update Jetson tests in Docker container (#5734) * Add flags for ORT build * Separate list with commas * Remove unnecessary detection of nvcc compiler * Fixed Jetson path for perf_client, datadir * Create version directoryy for custom model * Remove probe check for shm, add shm exceed error for Jetson * Copyright updates, fix Jetson Probe * Fix be_python test num on Jetson * Remove extra comma, non-Dockerized Jetson comment * Remove comment about Jetson being non-dockerized * Remove no longer needed flag * Update `main` post-23.05 release (#5880) * Update README and versions for 23.05 branch * Changes to support 23.05 (#5782) * Update python and conda version * Update CMAKE installation * Update checksum version * Update ubuntu base image to 22.04 * Use ORT 1.15.0 * Set CMAKE to pull latest version * Update libre package version * Removing unused argument * Adding condition for ubuntu 22.04 * Removing installation of the package from the devel container * Nnshah1 u22.04 (#5770) * Update CMAKE installation * Update python and conda version * Update CMAKE installation * Update checksum version * Update ubuntu base image to 22.04 * updating versions for ubuntu 22.04 * remove re2 --------- Co-authored-by: Neelay Shah Co-authored-by: Neelay Shah * Set ONNX version to 1.13.0 * Fix L0_custom_ops for ubuntu 22.04 (#5775) * add back rapidjson-dev --------- Co-authored-by: Neelay Shah Co-authored-by: Neelay Shah Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com> * Fix L0_mlflow (#5805) * working thread * remove default install of blinker * merge issue fixed * Fix L0_backend_python/env test (#5799) * Fix L0_backend_python/env test * Address comment * Update the copyright * Fix up * Fix L0_http_fuzz (#5776) * installing python 3.8.16 for test * spelling Co-authored-by: Neelay Shah * use util functions to install python3.8 in an easier way --------- Co-authored-by: Neelay Shah * Update Windows versions for 23.05 release (#5826) * Rename Ubuntu 20.04 mentions to 22.04 (#5849) * Update DCGM version (#5856) * Update DCGM version (#5857) * downgrade DCGM version to 2.4.7 (#5860) * Updating link for latest release notes to 23.05 --------- Co-authored-by: Neelay Shah Co-authored-by: Neelay Shah Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com> Co-authored-by: Iman Tabrizian * Disable memory tracker on Jetpack until the library is available (#5882) * Fix datadir for x86 (#5894) * Add more test on instance signature (#5852) * Add testing for new error handling API (#5892) * Test batch input for libtorch (#5855) * Draft ragged TensorRT unit model gen * Draft libtorch special identity model * Autoformat * Update test, fix ragged model gen * Update suffix for io for libtorch * Remove unused variables * Fix io names for libtorch * Use INPUT0/OUTPUT0 for libtorch * Reorder to match test model configs * Remove unnecessary capitalization * Auto-format * Capitalization is necessary * Remove unnecessary export * Clean up Azure dependency in server build (#5900) * [DO NOT MERGE] * Remove Azure dependency in server component build * Finalize * Fix dependency * Fixing up * Clean up * Add response parameters for streaming GRPC inference to enhance decoupled support (#5878) * Update 'main' to track development of 2.36.0 / 23.07 (#5917) * Add test for detecting S3 http2 upgrade request (#5911) * Add test for detecting S3 http2 upgrade request * Enhance testing * Copyright year update * Add Redis cache build, tests, and docs (#5916) * Updated handling for uint64 request priority * Ensure HPCX dependencies found in container (#5922) * Add HPCX dependencies to search path * Copy hpcx to CPU-only container * Add ucc path to CPU-only image * Fixed if statement * Fix df variable * Combine hpcx LD_LIBRARY_PATH * Add test case where MetricFamily is deleted before deleting Metric (#5915) * Add test case for metric lifetime error handling * Address comment * Use different MetricFamily name * Add testing for Pytorch instance group kind MODEL (#5810) * Add testing for Pytorch instance group kind MODEL * Remove unused item * Update testing to verify the infer result * Add copyright * Remove unused import * Update pip install * Update the model to use the same add sub logic * Add torch multi-gpu and multi-device models to L0_io * Fix up model version * Add test for sending instance update config via load API (#5937) * Add test for passing config via load api * Add more docs on instance update behavior * Update to suggested docs Co-authored-by: Ryan McCormick * Use dictionary for json config * Modify the config fetched from Triton instead --------- Co-authored-by: Ryan McCormick * Fix L0_batcher count check (#5939) * Add testing for json tensor format (#5914) * Add redis config and use local logfile for redis server (#5945) * Add redis config and use local logfile for redis server * Move redis log config to CLI * Have separate redis logs for unit tests and CLI tests * Add test on rate limiter max resource decrease update (#5885) * Add test on rate limiter max resource decrease update * Add test with explicit resource * Check server log for decreased resource limit * Add docs on decoupled final response feature (#5936) * Allow changing ping behavior based on env variable in SageMaker and entrypoint updates (#5910) * Allow changing ping behavior based on env variable in SageMaker * Add option for additional args * Make ping further configurable * Allow further configuration of grpc and http ports * Update docker/sagemaker/serve * Update docker/sagemaker/serve --------- Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> * Remove only MPI libraries in HPCX in L0_perf_analyzer (#5967) * Be more specific with MPI removal * Delete all libmpi libs * Ensure L0_batch_input requests received in order (#5963) * Add print statements for debugging * Add debugging print statements * Test using grpc client with stream to fix race * Use streaming client in all non-batch tests * Switch all clients to streaming GRPC * Remove unused imports, vars * Address comments * Remove random comment * Set inputs as separate function * Split set inputs based on test type * Add test for redis cache auth credentials via env vars (#5966) * Auto-formatting (#5979) * Auto-format * Change to clang-format-15 in CONTRIBTUING * Adding tests ensuring locale setting is passed to python backend interpreter * Refactor build.py CPU-only Linux libs for readability (#5990) * Improve the error message when the number of GPUs is insufficient (#5993) * Update README to include CPP-API Java Bindings (#5883) * Update env variable to use for overriding /ping behavior (#5994) * Add test that >1000 model files can be loaded in S3 (#5976) * Add test for >1000 files * Capitalization for consistency * Add bucket cleaning at end * Move test pass/fail to end * Check number of files in model dir at load time * Add testing for GPU tensor error handling (#5871) * Add testing for GPU tensor error handling * Fix up * Remove exit 0 * Fix jetson * Fix up * Add test for Python BLS model loading API (#5980) * Add test for Python BLS model loading API * Fix up * Update README and versions for 23.06 branch * Fix LD_LIBRARY_PATH for PyTorch backend * Return updated df in add_cpu_libs * Remove unneeded df param * Update test failure messages to match Dataloader changes (#6006) * Add dependency for L0_python_client_unit_tests (#6010) * Improve performance tuning guide (#6026) * Enabling nested spans for trace mode OpenTelemetry (#5928) * Adding nested spans to OTel tracing + support of ensemble models * Move multi-GPU dlpack test to a separate L0 test (#6001) * Move multi-GPU dlpack test to a separate L0 test * Fix copyright * Fix up * OpenVINO 2023.0.0 (#6031) * Upgrade OV to 2023.0.0 * Upgrade OV model gen script to 2023.0.0 * Add test to check the output memory type for onnx models (#6033) * Add test to check the output memory type for onnx models * Remove unused import * Address comment * Add testing for implicit state for PyTorch backend (#6016) * Add testing for implicit state for PyTorch backend * Add testing for libtorch string implicit models * Fix CodeQL * Mention that libtorch backend supports implicit state * Fix CodeQL * Review edits * Fix output tests for PyTorch backend * Allow uncompressed conda execution enviroments (#6005) Add test for uncompressed conda execution enviroments * Fix implicit state test (#6039) * Adding target_compile_features cxx_std_17 to tracing lib (#6040) * Update 'main' to track development of 2.37.0 / 23.08 * Fix intermittent failure in L0_model_namespacing (#6052) * Fix PyTorch implicit model mounting in gen_qa_model_repository (#6054) * Fix broken links pointing to the `grpc_server.cc` file (#6068) * Fix L0_backend_python expected instance name (#6073) * Fix expected instance name * Copyright year * Fix L0_sdk: update the search name for the client wheel (#6074) * Fix name of client wheel to be looked for * Fix up * Add GitHub action to format and lint code (#6022) * Add pre-commit * Fix typos, exec/shebang, formatting * Remove clang-format * Update contributing md to include pre-commit * Update spacing in CONTRIBUTING * Fix contributing pre-commit link * Link to pre-commit install directions * Wording * Restore clang-format * Fix yaml spacing * Exclude templates folder for check-yaml * Remove unused vars * Normalize spacing * Remove unused variable * Normalize config indentation * Update .clang-format to enforce max line length of 80 * Update copyrights * Update copyrights * Run workflows on every PR * Fix copyright year * Fix grammar * Entrypoint.d files are not executable * Run pre-commit hooks * Mark not executable * Run pre-commit hooks * Remove unused variable * Run pre-commit hooks after rebase * Update copyrights * Fix README.md typo (decoupled) Co-authored-by: Ryan McCormick * Run pre-commit hooks * Grammar fix Co-authored-by: Ryan McCormick * Redundant word Co-authored-by: Ryan McCormick * Revert docker file changes * Executable shebang revert * Make model.py files non-executable * Passin is proper flag * Run pre-commit hooks on init_args/model.py * Fix typo in init_args/model.py * Make copyrights one line --------- Co-authored-by: Ryan McCormick * Fix default instance name change when count is 1 (#6088) * Add test for sequence model instance update (#5831) * Add test for sequence model instance update * Add gap for file timestamp update * Update test for non-blocking sequence update * Update documentation * Remove mentioning increase instance count case * Add more documentaion for scheduler update test * Update test for non-blocking batcher removal * Add polling due to async scheduler destruction * Use _ as private * Fix typo * Add docs on instance count decrease * Fix typo * Separate direct and oldest to different test cases * Separate nested tests in a loop into multiple test cases * Refactor scheduler update test * Improve doc on handling future test failures * Address pre-commit * Add best effort to reset model state after a single test case failure * Remove reset model method to make harder for chaining multiple test cases as one * Remove description on model state clean up * Fix default instance name (#6097) * Removing unused tests (#6085) * Update post-23.07 release (#6103) * Update README and versions for 2.36.0 / 23.07 * Update Dockerfile.win10.min * Fix formating issue * fix formating issue * Fix whitespaces * Fix whitespaces * Fix whitespaces * Improve asyncio testing (#6122) * Reduce instance count to 1 for python bls model loading test (#6130) * Reduce instance count to 1 for python bls model loading test * Add comment when calling unload * Fix queue test to expect exact number of failures (#6133) * Fix queue test to expect exact number of failures * Increase the execution time to more accurately capture requests * Add CPU & GPU metrics in Grafana dashboard.json for K8s op prem deployment (fix #6047) (#6100) Signed-off-by: Xiaodong Ye * Adding the support tracing of child models invoked from a BLS model (#6063) * Adding tests for bls * Added fixme, cleaned previous commit * Removed unused imports * Fixing commit tree: Refactor code, so that OTel tracer provider is initialized only once Added resource cmd option, testig Added docs * Clean up * Update docs/user_guide/trace.md Co-authored-by: Ryan McCormick * Revision * Update doc * Clean up * Added ostream exporter to OpenTelemetry for testing purposes; refactored trace tests * Added opentelemetry trace collector set up to tests; refactored otel exporter tests to use OTel collector instead of netcat * Revising according to comments * Added comment regarding 'parent_span_id' * Added permalink * Adjusted test --------- Co-authored-by: Ryan McCormick * Test python environments 3.8-3.11 (#6109) Add tests for python 3.8-3.11 for L0_python_backends * Improve L0_backend_python debugging (#6157) * Improve L0_backend_python debugging * Use utils function for artifacts collection * Add unreachable output test for reporting source of disconnectivity (#6149) * Update 'main' to track development of 2.38.0 / 23.09 (#6163) * Fix the versions in the doc (#6164) * Update docs with NVAIE messaging (#6162) Update docs with NVAIE messaging * Add sanity tests for parallel instance loading (#6126) * Remove extra whitespace (#6174) * Remove a test case that sanity checks input value of --shape CLI flag (#6140) * Remove test checking for --shape option * Remove the entire test * Add test when unload/load requests for same model is received at the same time (#6150) * Add test when unload/load requests for same model received the same time * Add test_same_model_overlapping_load_unload * Use a load/unload stress test instead * Pre-merge test name update * Address pre-commit error * Revert "Address pre-commit error" This reverts commit 781cab1bfe816a3ffd5eaf23b01a7bfa38314bcd. * Record number of occurrence of each exception * Make assert failures clearer in L0_trt_plugin (#6166) * Add end-to-end CI test for decoupled model support (#6131) (#6184) * Add end-to-end CI test for decoupled model support * Address feedback * Test preserve_ordering for oldest strategy sequence batcher (#6185) * added debugging guide (#5924) * added debugging guide * Run pre-commit --------- Co-authored-by: David Yastremsky * Add deadlock gdb section to debug guide (#6193) * Fix character escape in model repository documentation (#6197) * Fix docs test (#6192) * Add utility functions for array manipulation (#6203) * Add utility functions for outlier removal * Fix functions * Add newline to end of file * Add gc collect to make sure gpu tensor is deallocated (#6205) * Testing: add gc collect to make sure gpu tensor is deallocated * Address comment * Check for log error on failing to find explicit load model (#6204) * Set default shm size to 1MB for Python backend (#6209) * Trace Model Name Validation (#6199) * Initial commit * Cleanup using new standard formatting * QA test restructuring * Add newline to the end of test.sh * HTTP/GRCP protocol changed to pivot on ready status & error status. Log file name changed in qa test. * Fixing unhandled error memory leak * Handle index function memory leak fix * Fix the check for error message (#6226) * Fix copyright for debugging guide (#6225) * Add watts units to GPU power metric descriptions (#6242) * Update post-23.08 release (#6234) * CUDA 12.1 > 12.2 * DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors * Revert "DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors" This reverts commit 0cecbb7461fd944ff09f456011dfab960dff170e. * Update Dockerfile.win10.min * Update Dockerfile.win10.min * Update README and versions for 23.08 branch * Update Dockerfile.win10 * Fix the versions in docs * Add the note about stabilization of the branch * Update docs with NVAIE messaging (#6162) (#6167) Update docs with NVAIE messaging Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com> * Resolve merge conflict --------- Co-authored-by: tanmayv25 Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com> * Add tests/docs for queue size (pending request count) metric (#6233) * Adding safe string to number conversions (#6173) * Added catch for out of range error for trace setting update * Added wrapper to safe parse options * Added option names to errors * Adjustments * Quick fix * Fixing option name for Windows * Removed repetitive code * Adjust getopt_long for Windows to use longindex * Moved try catch into ParseOption * Removed unused input * Improved names * Refactoring and clean up * Fixed Windows * Refactored getopt_long for Windows * Refactored trace test, pinned otel's collector version to avoid problems with go requirements * Test Python execute() to return Triton error code (#6228) * Add test for Python execute error code * Add all supported error codes into test * Move ErrorCode into TritonError * Expose ErrorCode internal in TritonError * Add docs on IPv6 (#6262) * Add test for TensorRT version-compatible model support (#6255) * Add tensorrt version-compatibility test * Generate one version-compatible model * Fix copyright year * Remove unnecessary variable * Remove unnecessary line * Generate TRT version-compatible model * Add sample inference to TRT version-compatible test * Clean up utils and model gen for new plan model * Fix startswith capitalization * Remove unused imports * Remove unused imports * Add log check * Upgrade protobuf version (#6268) * Add testing for retrieving shape and datatype in backend API (#6231) Add testing for retrieving output shape and datatype info from backend API * Update 'main' to track development of 2.39.0 / 23.10 (#6277) * Apply UCX workaround (#6254) * Add ensemble parameter forwarding test (#6284) * Exclude extra TRT version-compatible models from tests (#6294) * Exclude compatible models from tests. * Force model removal, in case it does not exist Co-authored-by: Ryan McCormick --------- Co-authored-by: Ryan McCormick * Adding installation of docker and docker-buildx (#6299) * Adding installation of docker and docker-buildx * remove whitespace * Use targetmodel from header as model name in SageMaker (#6147) * Use targetmodel from header as model name in SageMaker * Update naming for model hash * Add more error messages, return codes, and refactor HTTP server (#6297) * Fix typo (#6318) * Update the request re-use example (#6283) * Update the request re-use example * Review edit * Review comment * Disable developer tools build for In-process API + JavaCPP tests (#6296) * Add Python binding build. Add L0_python_api to test Python binding (#6319) * Add L0_python_api to test Python binding * Install Python API in CI image * Fix QA build * Increase network timeout for valgrind (#6324) * Tests and docs for ability to specify subdirectory to download for LocalizePath (#6308) * Added custom localization tests for s3 and azure, added docs * Refactor HandleInfer into more readable chunks (#6332) * Refactor model generation scripts (#6336) * Refactor model generation scripts * Fix codeql * Fix relative path import * Fix package structure * Copy the gen_common file * Add missing uint8 * Remove duplicate import * Add testing for scalar I/O in ORT backend (#6343) * Add testing for scalar I/O in ORT backend * Review edit * ci * Update post-23.09 release (#6367) * Update README and versions for 23.09 branch (#6280) * Update `Dockerfile` and `build.py` (#6281) * Update configuration for Windows Dockerfile (#6256) * Adding installation of docker and docker-buildx * Enable '--expt-relaxed-constexpr' flag for custom ops models * Upate Dockerfile version * Disable unit tests for Jetson * Update condition (#6285) * removing Whitespaces (#6293) * removing Whitespaces * removing whitespaces * Add security policy (#6376) * Adding client-side request cancellation support and testing (#6383) * Add L0_request_cancellation (#6252) * Add L0_request_cancellation * Remove unittest test * Add cancellation to gRPC server error handling * Fix up * Use identity model * Add tests for gRPC client-side cancellation (#6278) * Add tests for gRPC client-side cancellation * Fix CodeQL issues * Formatting * Update qa/L0_client_cancellation/client_cancellation_test.py Co-authored-by: Ryan McCormick * Move to L0_request_cancellation * Address review comments * Removing request cancellation support from asyncio version * Format * Update copyright * Remove tests * Handle cancellation notification in gRPC server (#6298) * Handle cancellation notification in gRPC server * Fix the request ptr initialization * Update src/grpc/infer_handler.h Co-authored-by: Ryan McCormick * Address review comment * Fix logs * Fix request complete callback by removing reference to state * Improve documentation --------- Co-authored-by: Ryan McCormick --------- Co-authored-by: Ryan McCormick * Fixes on the gRPC frontend to handle AsyncNotifyWhenDone() API (#6345) * Fix segmentation fault in gRPC frontend * Finalize all states upon completion * Fixes all state cleanups * Handle completed states when cancellation notification is received * Add more documentation steps * Retrieve dormant states to minimize the memory footprint for long streams * Update src/grpc/grpc_utils.h Co-authored-by: Ryan McCormick * Use a boolean state instead of raw pointer --------- Co-authored-by: Ryan McCormick * Add L0_grpc_state_cleanup test (#6353) * Add L0_grpc_state_cleanup test * Add model file in QA container * Fix spelling * Add remaining subtests * Add failing subtests * Format fixes * Fix model repo * Fix QA docker file * Remove checks for the error message when shutting down server * Fix spelling * Address review comments * Add schedulers request cancellation tests (#6309) * Add schedulers request cancellation tests * Merge gRPC client test * Reduce testing time and covers cancelling other requests as a consequence of request cancellation * Add streaming request cancellation test --------- Co-authored-by: Iman Tabrizian Co-authored-by: Ryan McCormick Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Add missing copyright (#6388) * Add basic generate endpoints for LLM tasks (#6366) * PoC of parsing request prompt and converting to Triton infer request * Remove extra trace * Add generate endpoint * Enable streaming version * Fix bug * Fix up * Add basic testing. Cherry pick from #6369 * format * Address comment. Fix build * Minor cleanup * cleanup syntax * Wrap error in SSE format * Fix up * Restrict number of response on non-streaming generate * Address comment on implementation. * Re-enable trace on generate endpoint * Add more comprehensive llm endpoint tests (#6377) * Add security policy (#6376) * Start adding some more comprehensive tests * Fix test case * Add response error testing * Complete test placeholder * Address comment * Address comments * Fix code check --------- Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: GuanLuo * Address comment * Address comment * Address comment * Fix typo --------- Co-authored-by: Ryan McCormick Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> * Add Python backend request cancellation test (#6364) * Add cancelled response status test * Add Python backend request cancellation test * Add Python backend decoupled request cancellation test * Simplified response if cancelled * Test response_sender.send() after closed * Rollback test response_sender.send() after closed * Rollback non-decoupled any response on cancel * Add TRT-LLM backend build to Triton (#6365) (#6392) * Add TRT-LLM backend build to Triton (#6365) * Add trtllm backend to build * Temporarily adding version map for 23.07 * Fix build issue * Update comment * Comment out python binding changes * Add post build * Update trtllm backend naming * Update TRTLLM base image * Fix cmake arch * Revert temp changes for python binding PR * Address comment * Move import to the top (#6395) * Move import to the top * pre commit format * Add Python backend when vLLM backend built (#6397) * Update build.py to build vLLM backend (#6394) * Support parameters object in generate route * Update 'main' to track development of 2.40.0 / 23.11 (#6400) * Fix L0_sdk (#6387) * Add documentation on request cancellation (#6403) * Add documentation on request cancellation * Include python backend * Update docs/user_guide/request_cancellation.md Co-authored-by: Iman Tabrizian * Update docs/user_guide/request_cancellation.md Co-authored-by: Neelay Shah * Update docs/README.md Co-authored-by: Neelay Shah * Update docs/user_guide/request_cancellation.md Co-authored-by: Ryan McCormick * Remove inflight term from the main documentation * Address review comments * Fix * Update docs/user_guide/request_cancellation.md Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Fix --------- Co-authored-by: Iman Tabrizian Co-authored-by: Neelay Shah Co-authored-by: Ryan McCormick Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Fixes in request cancellation doc (#6409) * Document generate HTTP endpoint (#6412) * Document generate HTTP endpoint * Address comment * Fix up * format * Address comment * Update SECURITY.md to not display commented copyright (#6426) * Fix missing library in L0_data_compression (#6424) * Fix missing library in L0_data_compression * Fix up * Add Javacpp-presets repo location as env variable in Java tests(#6385) Simplify testing when upstream (javacpp-presets) build changes. Related to triton-inference-server/client#409 * TRT-LLM backend build changes (#6406) * Update url * Debugging * Debugging * Update url * Fix build for TRT-LLM backend * Remove TRTLLM TRT and CUDA versions * Fix up unused var * Fix up dir name * FIx cmake patch * Remove previous TRT version * Install required packages for example models * Remove packages that are only needed for testing * Add gRPC AsyncIO request cancellation tests (#6408) * Fix gRPC test failure and refactor * Add gRPC AsyncIO cancellation tests * Better check if a request is cancelled * Use f-string * Fix L0_implicit_state (#6427) * Fixing vllm build (#6433) * Fixing torch version for vllm * Switch Jetson model TensorRT models generation to container (#6378) * Switch Jetson model TensorRT models generation to container * Adding missed file * Fix typo * Fix typos * Remove extra spaces * Fix typo * Bumped vllm version (#6444) * Adjust test_concurrent_same_model_load_unload_stress (#6436) * Adding emergency vllm latest release (#6454) * Fix notify state destruction and inflight states tracking (#6451) * Ensure notify_state_ gets properly destructed * Fix inflight state tracking to properly erase states * Prevent removing the notify_state from being erased * Wrap notify_state_ object within unique_ptr * Update TRT-LLM backend url (#6455) * TRTLLM backend post release * TRTLLM backend post release * Update submodule url for permission issue * Update submodule url * Fix up * Not using postbuild function to workaround submodule url permission issue * Added docs on python based backends (#6429) Co-authored-by: Neelay Shah * L0_model_config Fix (#6472) * Minor fix for L0_model_config * Add test for Python model parameters (#6452) * Test Python BLS with different sizes of CUDA memory pool (#6276) * Test with different sizes of CUDA memory pool * Check the server log for error message * Improve debugging * Fix syntax * Add documentation for K8s-onprem StartupProbe (#5257) Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: Ryan McCormick * Update `main` post-23.10 release (#6484) * Update README and versions for 23.10 branch (#6399) * Cherry-picking vLLM backend changes (#6404) * Update build.py to build vLLM backend (#6394) * Add Python backend when vLLM backend built (#6397) --------- Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> * Add documentation on request cancellation (#6403) (#6407) * Add documentation on request cancellation * Include python backend * Update docs/user_guide/request_cancellation.md * Update docs/user_guide/request_cancellation.md * Update docs/README.md * Update docs/user_guide/request_cancellation.md * Remove inflight term from the main documentation * Address review comments * Fix * Update docs/user_guide/request_cancellation.md * Fix --------- Co-authored-by: Iman Tabrizian Co-authored-by: Neelay Shah Co-authored-by: Ryan McCormick Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Fixes in request cancellation doc (#6409) (#6410) * TRT-LLM backend build changes (#6406) (#6430) * Update url * Debugging * Debugging * Update url * Fix build for TRT-LLM backend * Remove TRTLLM TRT and CUDA versions * Fix up unused var * Fix up dir name * FIx cmake patch * Remove previous TRT version * Install required packages for example models * Remove packages that are only needed for testing * Fixing vllm build (#6433) (#6437) * Fixing torch version for vllm Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> * Update TRT-LLM backend url (#6455) (#6460) * TRTLLM backend post release * TRTLLM backend post release * Update submodule url for permission issue * Update submodule url * Fix up * Not using postbuild function to workaround submodule url permission issue * remove redundant lines * Revert "remove redundant lines" This reverts commit 86be7ad969b484e5b55a3c0541d21eee7a06d889. * restore missed lines * Update build.py Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> * Update build.py Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> --------- Co-authored-by: Tanmay Verma Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: Iman Tabrizian Co-authored-by: Neelay Shah Co-authored-by: Ryan McCormick Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Kris Hung Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com> Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> * Adding structure reference to the new document (#6493) * Improve L0_backend_python test stability (ensemble / gpu_tensor_lifecycle) (#6490) * Test torch allocator gpu memory usage directly rather than global gpu memory for more consistency * Add L0_generative_sequence test (#6475) * Add testing backend and test * Add test to build / CI. Minor fix on L0_http * Format. Update backend documentation * Fix up * Address comment * Add negative testing * Fix up * Downgrade vcpkg version (#6503) * Collecting sub dir artifacts in GitLab yaml. Removing collect function from test script. (#6499) * Use post build function for TRT-LLM backend (#6476) * Use postbuild function * Remove updating submodule url * Enhanced python_backend autocomplete (#6504) * Added testing for python_backend autocomplete: optional input and model_transaction_policy * Parse reuse-grpc-port and reuse-http-port as booleans (#6511) Co-authored-by: Francesco Petrini * Fixing L0_io (#6510) * Fixing L0_io * Add Python-based backends CI (#6466) * Bumped vllm version * Add python-bsed backends testing * Add python-based backends CI * Fix errors * Add vllm backend * Fix pre-commit * Modify test.sh * Remove vllm_opt qa model * Remove vLLM ackend tests * Resolve review comments * Fix pre-commit errors * Update qa/L0_backend_python/python_based_backends/python_based_backends_test.py Co-authored-by: Tanmay Verma * Remove collect_artifacts_from_subdir function call --------- Co-authored-by: oandreeva-nv Co-authored-by: Tanmay Verma * Enabling option to restrict access to HTTP APIs based on header value pairs (similar to gRPC) * Upgrade DCGM from 2.4.7 to 3.2.6 (#6515) * Enhance GCS credentials documentations (#6526) * Test file override outside of model directory (#6516) * Add boost-filesystem * Update ORT version to 1.16.2 (#6531) * Adjusting expected error msg (#6517) * Update 'main' to track development of 2.41.0 / 23.12 (#6543) * Enhance testing for pending request count (#6532) * Enhance testing for pending request count * Improve the documentation * Add more documentation * Add testing for Python backend request rescheduling (#6509) * Add testing * Fix up * Enhance testing * Fix up * Revert test changes * Add grpc endpoint test * Remove unused import * Remove unused import * Update qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py Co-authored-by: Iman Tabrizian * Update qa/python_models/bls_request_rescheduling/model.py Co-authored-by: Iman Tabrizian --------- Co-authored-by: Iman Tabrizian * Check that the wget is installed (#6556) * secure deployment considerations guide (#6533) * draft document * updates * updates * updated * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * update * updates * updates * Update docs/customization_guide/deploy.md Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com> * Update docs/customization_guide/deploy.md Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com> * fixing typos * updated with clearer warnings * updates to readme and toc --------- Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com> * Fix typo and change the command line order (#6557) * Fix typo and change the command line order * Improve visual experience. Add 'clang' package * Add error during rescheduling test to L0_generative_sequence (#6550) * changing references to concrete instances * Add testing for implicit state enhancements (#6524) * Add testing for single buffer * Add testing for implicit state with buffer growth * Improve testing * Fix up * Add CUDA virtual address size flag * Add missing test files * Parameter rename * Test fixes * Only build implicit state backend for GPU=ON * Fix copyright (#6584) * Mention TRT LLM backend supports request cancellation (#6585) * update model repository generation for onnx models for protobuf (#6575) * Fix L0_sagemaker (#6587) * Add C++ server wrapper to the doc (#6592) * Add timeout to client apis and tests (#6546) Client PR: triton-inference-server/client#429 * Change name generative -> iterative (#6601) * name changes * updated names * Add documentation on generative sequence (#6595) * Add documentation on generative sequence * Address comment * Reflect the "iterative" change * Updated description of iterative sequences * Restricted HTTP API documentation Co-authored-by: Ryan McCormick * Add request cancellation and debugging guide to generated docs (#6617) * Support for http request cancellation. Includes fix for seg fault in generate_stream endpoint. * Bumped vLLM version to v0.2.2 (#6623) * Upgrade ORT version (#6618) * Use compliant preprocessor (#6626) * Update README.md (#6627) * Extend request objects lifetime and fixes possible segmentation fault (#6620) * Extend request objects lifetime * Remove explicit TRITONSERVER_InferenceRequestDelete * Format fix * Include the inference_request_ initialization to cover RequestNew --------- Co-authored-by: Neelay Shah * Update protobuf after python update for testing (#6638) This fixes the issue where python client has `AttributeError: 'NoneType' object has no attribute 'enum_types_by_name' errors after python version is updated. * Update post-23.11 release (#6653) * Update README and versions for 2.40.0 / 23.11 (#6544) * Removing path construction to use SymLink alternatives * Update version for PyTorch * Update windows Dockerfile configuration * Update triton version to 23.11 * Update README and versions for 2.40.0 / 23.11 * Fix typo * Ading 'ldconfig' to configure dynamic linking in container (#6602) * Point to tekit_backend (#6616) * Point to tekit_backend * Update version * Revert tekit changes (#6640) --------- Co-authored-by: Kris Hung * PYBE Timeout Tests (#6483) * New testing to confirm large request timeout values can be passed and retrieved within Python BLS models. * Add note on lack of ensemble support (#6648) * Added request id to span attributes (#6667) * Add test for optional internal tensor within an ensemble (#6663) * Add test for optional internal tensor within an ensemble * Fix up * Set CMake version to 3.27.7 (#6675) * Set CMake version to 3.27.7 * Set CMake version to 3.27.7 * Fix double slash typo * restore typo (#6680) * Update 'main' to track development of 2.42.0 / 24.01 (#6673) * iGPU build refactor (#6684) (#6691) * Mlflow Plugin Fix (#6685) * Mlflow plugin fix * Fix extra content-type headers in HTTP server (#6678) * Fix iGPU CMakeFile tags (#6695) * Unify iGPU test build with x86 ARM * adding TRITON_IGPU_BUILD to core build definition; adding logic to skip caffe2plan test if TRITON_IGPU_BUILD=1 * re-organizing some copies in Dockerfile.QA to fix igpu devel build * Pre-commit fix --------- Co-authored-by: kyle * adding default value for TRITON_IGPU_BUILD=OFF (#6705) * adding default value for TRITON_IGPU_BUILD=OFF * fix newline --------- Co-authored-by: kyle * Add test case for decoupled model raising exception (#6686) * Add test case for decoupled model raising exception * Remove unused import * Address comment * Escape special characters in general docs (#6697) * vLLM Benchmarking Test (#6631) * vLLM Benchmarking Test * Allow configuring GRPC max connection age and max connection age grace (#6639) * Add ability to configure GRPC max connection age and max connection age grace * Allow pass GRPC connection age args when they are set from command ---------- Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com> --------- Signed-off-by: Xiaodong Ye Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: Neelay Shah Co-authored-by: Tanmay Verma Co-authored-by: Kris Hung Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Ryan McCormick Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com> Co-authored-by: Iman Tabrizian Co-authored-by: Gerard Casas Saez Co-authored-by: Misha Chornyi <99709299+mc-nv@users.noreply.github.com> Co-authored-by: R0CKSTAR Co-authored-by: Elias Bermudez <6505145+debermudez@users.noreply.github.com> Co-authored-by: ax-vivien <113907557+ax-vivien@users.noreply.github.com> Co-authored-by: Neelay Shah Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com> Co-authored-by: Matthew Kotila Co-authored-by: Nikhil Kulkarni Co-authored-by: Misha Chornyi Co-authored-by: Iman Tabrizian Co-authored-by: David Yastremsky Co-authored-by: Timothy Gerdes <50968584+tgerdesnv@users.noreply.github.com> Co-authored-by: Mate Mijolović Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com> Co-authored-by: Hyunjae Woo <107147848+nv-hwoo@users.noreply.github.com> Co-authored-by: Tanay Varshney Co-authored-by: Francesco Petrini Co-authored-by: Dmitry Mironov Co-authored-by: Ryan McCormick Co-authored-by: Sai Kiran Polisetty Co-authored-by: oandreeva-nv Co-authored-by: kyle Co-authored-by: Neal Vaidya Co-authored-by: siweili11 <152239970+siweili11@users.noreply.github.com> --- .clang-format | 4 +- .github/workflows/codeql.yml | 84 + .github/workflows/pre-commit.yaml | 39 + .gitignore | 5 + .pre-commit-config.yaml | 74 + CITATION.cff | 7 + CMakeLists.txt | 67 +- CONTRIBUTING.md | 34 +- Dockerfile.QA | 92 +- Dockerfile.sdk | 53 +- Dockerfile.win10.min | 161 +- LICENSE | 2 +- README.md | 178 +- SECURITY.md | 44 + TRITON_VERSION | 2 +- build.py | 2598 +++++---- compose.py | 437 +- deploy/alibaba-cloud/README.md | 10 +- deploy/aws/README.md | 32 +- deploy/aws/templates/deployment.yaml | 6 +- deploy/aws/values.yaml | 4 +- deploy/fleetcommand/Chart.yaml | 13 +- deploy/fleetcommand/README.md | 10 +- deploy/fleetcommand/templates/deployment.yaml | 2 + deploy/fleetcommand/templates/secrets.yaml | 2 + deploy/fleetcommand/values.yaml | 8 +- deploy/gcp/README.md | 28 +- deploy/gcp/values.yaml | 4 +- deploy/gke-marketplace-app/README.md | 93 +- .../gke-marketplace-app/benchmark/README.md | 17 +- .../model-store/bert_base_tf_gpu/config.pbtxt | 4 +- .../bert_base_trt_gpu/config.pbtxt | 4 +- .../bert_distill_tf_cpu/config.pbtxt | 4 +- .../bert_distill_tf_gpu/config.pbtxt | 4 +- .../perf-analyzer-script/perf_query.sh | 0 .../perf-analyzer-script/triton_client.yaml | 4 +- .../client-sample/bert_request.json | 6 +- ...tfile_bert_large.py => locustfile_bert.py} | 17 +- .../client-sample/perf_analyzer_grpc.sh | 2 +- .../server-deployer/build_and_push.sh | 9 +- .../server-deployer/chart/triton/Chart.yaml | 6 +- .../chart/triton/templates/application.yaml | 16 +- .../chart/triton/templates/deployment.yaml | 4 +- .../chart/triton/templates/hpa.yaml | 10 +- .../chart/triton/templates/ingress.yaml | 48 + .../chart/triton/templates/service.yaml | 6 +- .../server-deployer/chart/triton/values.yaml | 10 +- .../server-deployer/data-test/schema.yaml | 26 +- .../server-deployer/schema.yaml | 26 +- .../gke-marketplace-app/trt-engine/README.md | 63 + deploy/k8s-onprem/README.md | 38 +- deploy/k8s-onprem/dashboard.json | 690 ++- deploy/k8s-onprem/templates/deployment.yaml | 13 + deploy/k8s-onprem/values.yaml | 2 +- deploy/mlflow-triton-plugin/README.md | 13 +- .../onnx_float32_int32_int32/config.pbtxt | 0 .../mlflow_triton/__init__.py | 6 +- .../mlflow_triton/config.py | 115 +- .../mlflow_triton/deployments.py | 359 +- .../scripts/publish_model_to_mlflow.py | 22 +- .../scripts/triton_flavor.py | 16 +- deploy/mlflow-triton-plugin/setup.py | 10 +- docker/cpu_only/entrypoint.d/12-banner.sh | 0 .../entrypoint.d/50-gpu-driver-check2.sh | 0 .../entrypoint.d/15-container-copyright.txt | 2 +- docker/entrypoint.d/50-gpu-driver-check2.sh | 0 docker/entrypoint.d/99-check-run-aip-mode.sh | 0 docker/sagemaker/serve | 71 +- docs/.gitignore | 1 + docs/Dockerfile.docs | 54 + docs/Makefile | 53 + docs/README.md | 277 +- docs/_static/.gitattributes | 2 + docs/_static/custom.css | 319 ++ docs/_static/logo_2color_horizontal.svg | 2 + docs/_static/logo_2color_vertical.svg | 2 + .../nvidia-logo-horiz-rgb-blk-for-screen.png | 3 + .../nvidia-logo-vert-rgb-blk-for-screen.png | 3 + docs/_static/rtd-data.js | 36 + docs/_templates/layout.html | 31 + docs/conf.py | 256 + docs/contents.md | 104 + docs/{ => customization_guide}/build.md | 70 +- docs/{ => customization_guide}/compose.md | 18 +- docs/customization_guide/deploy.md | 279 + .../inference_protocols.md | 187 +- .../repository_agents.md | 2 +- docs/{ => customization_guide}/test.md | 6 +- docs/examples/README.md | 35 + docs/examples/jetson/README.md | 14 +- .../concurrency_and_dynamic_batching/Makefile | 6 +- .../README.md | 24 +- .../concurrency_and_dynamic_batching/common.h | 1 + .../people_detection.cc | 7 +- .../tao/convert_peoplenet.sh | 0 .../simple_identity/config.pbtxt | 0 docs/{ => getting_started}/quickstart.md | 42 +- docs/index.md | 106 + docs/metrics.md | 143 - docs/perf_analyzer.md | 667 --- docs/protocol/README.md | 63 +- docs/protocol/extension_binary_data.md | 4 +- docs/protocol/extension_classification.md | 6 +- docs/protocol/extension_generate.md | 188 + docs/protocol/extension_logging.md | 198 + .../protocol/extension_model_configuration.md | 12 +- docs/protocol/extension_model_repository.md | 54 +- docs/protocol/extension_parameters.md | 104 + docs/protocol/extension_schedule_policy.md | 33 +- docs/protocol/extension_sequence.md | 6 +- docs/protocol/extension_shared_memory.md | 35 +- docs/protocol/extension_statistics.md | 74 +- docs/protocol/extension_trace.md | 36 +- docs/response_cache.md | 87 - docs/trace.md | 305 -- docs/{ => user_guide}/architecture.md | 25 +- docs/{ => user_guide}/custom_operations.md | 47 +- docs/user_guide/debugging_guide.md | 151 + docs/{ => user_guide}/decoupled_models.md | 45 +- docs/{ => user_guide}/faq.md | 72 +- docs/{ => user_guide}/images/arch.jpg | Bin .../images/dyna_sequence_example0.png | Bin .../images/dyna_sequence_example1.png | Bin .../images/ensemble_example0.png | Bin .../images/multi_model_exec.png | Bin .../images/multi_model_parallel_exec.png | Bin .../images/multi_model_serial_exec.png | Bin .../images/sequence_example0.png | Bin .../images/sequence_example1.png | Bin .../images/sequence_example2.png | Bin .../images/triton_on_jetson.png | Bin docs/{ => user_guide}/jetson.md | 44 +- docs/user_guide/metrics.md | 345 ++ docs/user_guide/model_analyzer.md | 45 + docs/{ => user_guide}/model_configuration.md | 287 +- docs/{ => user_guide}/model_management.md | 91 +- docs/{ => user_guide}/model_repository.md | 152 +- docs/{ => user_guide}/optimization.md | 71 +- docs/user_guide/perf_analyzer.md | 30 + docs/user_guide/performance_tuning.md | 393 ++ docs/{ => user_guide}/ragged_batching.md | 3 +- docs/{ => user_guide}/rate_limiter.md | 4 +- docs/user_guide/request_cancellation.md | 102 + docs/user_guide/response_cache.md | 243 + docs/user_guide/trace.md | 539 ++ docs/{ => user_guide}/v1_to_v2.md | 4 +- pyproject.toml | 51 + qa/L0_async_work_queue/test.sh | 0 qa/L0_backend_bls/test.sh | 15 +- qa/L0_backend_config/test.sh | 127 +- qa/L0_backend_fastertransformer/test.sh | 83 + qa/L0_backend_identity/identity_test.py | 235 +- qa/L0_backend_identity/test.sh | 2 +- qa/L0_backend_output_detail/test.sh | 69 + .../models/argument_validation/1/model.py | 110 +- .../argument_validation/test.sh | 7 +- .../bls/bls_parameters_test.py | 71 + qa/L0_backend_python/bls/test.sh | 346 +- qa/L0_backend_python/common.sh | 34 +- qa/L0_backend_python/custom_metrics/test.sh | 85 + .../decoupled/decoupled_test.py | 215 +- .../decoupled/models/decoupled_bls/1/model.py | 175 +- .../models/decoupled_bls_stream/1/model.py | 132 + .../models/decoupled_bls_stream/config.pbtxt | 54 + .../models/decoupled_execute_error/1/model.py | 52 +- .../decoupled_raise_exception/1/model.py | 35 + .../decoupled_raise_exception/config.pbtxt | 55 + .../1/model.py | 47 +- .../1/model.py | 46 +- qa/L0_backend_python/decoupled/test.sh | 51 +- .../ensemble/ensemble_test.py | 68 +- qa/L0_backend_python/ensemble/test.sh | 13 +- qa/L0_backend_python/env/test.sh | 198 +- qa/L0_backend_python/examples/test.sh | 205 +- qa/L0_backend_python/io/io_test.py | 140 +- qa/L0_backend_python/io/test.sh | 82 +- .../lifecycle/lifecycle_test.py | 156 +- qa/L0_backend_python/lifecycle/test.sh | 22 +- qa/L0_backend_python/logging/logging_test.py | 58 + qa/L0_backend_python/logging/test.sh | 231 + .../model_control/model_control_test.py | 23 +- qa/L0_backend_python/model_control/test.sh | 8 +- .../python_based_backends_test.py | 144 + .../python_based_backends/test.sh | 113 + qa/L0_backend_python/python_test.py | 388 +- qa/L0_backend_python/python_unittest.py | 55 +- .../grpc_endpoint_test.py | 111 + .../request_rescheduling/test.sh | 116 + .../restart/models/restart/1/model.py | 21 +- qa/L0_backend_python/restart/restart_test.py | 24 +- qa/L0_backend_python/restart/test.sh | 7 +- .../setup_python_enviroment.sh | 171 + qa/L0_backend_python/test.sh | 171 +- qa/L0_backend_python/variants/test.sh | 2 +- qa/L0_backend_tutorial/test.sh | 23 +- qa/L0_batch_custom/batch_custom_test.py | 273 + qa/L0_batch_custom/test.sh | 192 + qa/L0_batch_input/batch_input_test.py | 272 +- qa/L0_batch_input/test.sh | 8 +- qa/L0_batcher/batcher_test.py | 1362 +++-- qa/L0_batcher/test.sh | 92 +- qa/L0_batcher/verify_timestamps.py | 45 +- .../buffer_attributes_test.py | 65 +- qa/L0_buffer_attributes/models/bls/1/model.py | 19 +- .../models/identity/1/model.py | 7 +- qa/L0_buffer_attributes/test.sh | 3 +- qa/L0_client_build_variants/test.sh | 37 +- qa/L0_client_java/test.sh | 0 .../client_memory_mail.py | 13 +- .../models/custom_identity_int32/config.pbtxt | 2 +- qa/L0_client_memory_growth/test.sh | 46 +- qa/L0_client_nobatch/client_test.py | 205 +- ...t_test.py => client_infer_timeout_test.py} | 177 +- .../client_non_infer_timeout_test.py | 340 ++ .../models/custom_identity_int32/config.pbtxt | 2 +- qa/L0_client_timeout/test.sh | 80 +- .../models/custom_identity_int32/config.pbtxt | 2 +- qa/L0_client_valgrind/test.sh | 4 +- qa/L0_cmdline_trace/test.sh | 134 +- qa/L0_cmdline_trace/trace_client.py | 79 + qa/L0_config_json/max_priority_level.pbtxt | 62 + qa/L0_config_json/test.sh | 48 +- qa/L0_cuda_graph/test.sh | 51 +- qa/L0_cuda_graph/trt_cuda_graph_test.py | 85 +- .../cuda_shared_memory_test.py | 138 +- qa/L0_cuda_shared_memory/test.sh | 2 +- qa/L0_custom_ops/cuda_op_test.py | 66 +- qa/L0_custom_ops/mod_op_test.py | 77 +- qa/L0_custom_ops/onnx_op_test.py | 74 +- qa/L0_custom_ops/test.sh | 66 +- qa/L0_custom_ops/vision_op_test.py | 74 +- qa/L0_custom_ops/zero_out_test.py | 64 +- qa/L0_data_compression/test.sh | 7 +- qa/L0_data_compression/validation.py | 12 +- qa/L0_decoupled/decoupled_test.py | 530 +- qa/L0_decoupled/test.sh | 16 +- qa/L0_device_memory_tracker/test.py | 109 + qa/L0_device_memory_tracker/test.sh | 128 + .../unittest => L0_dlpack_multi_gpu}/test.sh | 21 +- qa/L0_doc_links/mkdocs.yml | 44 + qa/L0_doc_links/test.sh | 76 + qa/L0_dyna_implicit_state/test.sh | 15 +- .../dyna_sequence_batcher_test.py | 1016 ++-- qa/L0_dyna_sequence_batcher/test.sh | 16 +- .../client_plugin_test/1/model.py | 63 + .../client_plugin_test/config.pbtxt | 33 +- qa/L0_grpc/grpc_basic_auth_test.py | 66 + qa/L0_grpc/grpc_client_plugin_test.py | 120 + qa/L0_grpc/nginx.conf | 54 + qa/L0_grpc/python_grpc_aio_test.py | 125 + qa/L0_grpc/python_unit_test.py | 159 + qa/L0_grpc/test.sh | 175 +- qa/L0_grpc_state_cleanup/cleanup_test.py | 560 ++ qa/L0_grpc_state_cleanup/test.sh | 194 + qa/L0_http/generate_endpoint_test.py | 419 ++ .../generate_models/mock_llm/1/model.py | 107 + .../generate_models/mock_llm/config.pbtxt | 60 + qa/L0_http/http_basic_auth_test.py | 66 + qa/L0_http/http_client_plugin_test.py | 175 + qa/L0_http/http_restricted_api_test.py | 94 + qa/L0_http/http_test.py | 125 +- qa/L0_http/nginx.conf | 57 + qa/L0_http/python_http_aio_test.py | 116 + qa/L0_http/test.sh | 189 +- qa/L0_http_fuzz/fuzztest.py | 56 +- qa/L0_http_fuzz/test.sh | 16 +- qa/L0_https/test.sh | 29 +- qa/L0_implicit_state/implicit_state.py | 229 +- .../models/growable_memory/config.pbtxt | 103 + .../models/single_state_buffer/config.pbtxt | 97 + qa/L0_implicit_state/test.sh | 26 +- qa/L0_infer/infer_test.py | 1229 +++-- qa/L0_infer/install_and_test.sh | 2 +- qa/L0_infer/test.sh | 182 +- qa/L0_infer_reshape/infer_reshape_test.py | 257 +- qa/L0_infer_reshape/test.sh | 2 +- qa/L0_infer_variable/infer_variable_test.py | 453 +- qa/L0_infer_zero/infer_zero_test.py | 337 +- qa/L0_infer_zero/test.sh | 4 + qa/L0_inferentia_perf_analyzer/test.sh | 34 +- qa/L0_io/test.sh | 63 +- .../iterative_sequence_e2e.py | 192 + .../models/iterative_sequence/config.pbtxt | 48 + qa/L0_iterative_sequence/test.sh | 92 + .../MemoryGrowthTest.java | 1570 +++--- qa/L0_java_memory_growth/test.sh | 16 +- qa/L0_java_resnet/ResnetTest.java | 1039 ++-- qa/L0_java_resnet/test.sh | 12 +- qa/L0_java_sequence_batcher/SequenceTest.java | 1083 ++-- qa/L0_java_sequence_batcher/test.sh | 12 +- qa/L0_java_simple_example/test.sh | 12 +- qa/L0_json/test.sh | 44 + qa/L0_large_payload/large_payload_test.py | 103 +- qa/L0_large_payload/test.sh | 0 qa/L0_libtorch_inference_mode/test.sh | 4 +- .../client.py | 90 + .../gen_models.py | 90 + .../models/libtorch_multi_device/config.pbtxt | 60 + .../test.sh | 149 + qa/L0_libtorch_io_names/io_names_client.py | 52 +- qa/L0_libtorch_io_names/test.sh | 0 qa/L0_libtorch_io_types/test.sh | 131 + qa/L0_libtorch_nvfuser/test.sh | 3 +- qa/L0_libtorch_optimized_execution/test.sh | 0 .../libtorch_shared_weights_test.py | 25 +- qa/L0_libtorch_shared_weights/test.sh | 3 +- qa/L0_lifecycle/lifecycle_test.py | 2899 +++++++---- qa/L0_lifecycle/test.sh | 847 +-- qa/L0_logging/logging_endpoint_test.py | 405 ++ qa/L0_logging/test.sh | 595 +++ qa/L0_long_running_stress/crashing_client.py | 61 +- qa/L0_long_running_stress/scenarios.py | 654 +-- qa/L0_long_running_stress/stress.py | 530 +- qa/L0_long_running_stress/stress_mail.py | 28 +- qa/L0_long_running_stress/test.sh | 26 +- qa/L0_memory/test.sh | 0 qa/L0_memory_growth/busy_op_test.py | 84 +- qa/L0_memory_growth/server_memory_mail.py | 23 +- qa/L0_memory_growth/test.sh | 71 +- qa/L0_metrics/ensemble_delay/config.pbtxt | 67 + qa/L0_metrics/identity_delay/config.pbtxt | 58 + qa/L0_metrics/metrics_config_test.py | 134 + qa/L0_metrics/metrics_queue_size_test.py | 306 ++ qa/L0_metrics/test.sh | 240 +- .../identity_cache_off/config.pbtxt | 46 + .../identity_cache_on/config.pbtxt | 46 + qa/L0_mlflow/plugin_test.py | 55 +- qa/L0_mlflow/test.sh | 109 +- .../common/no_version/expected | 2 +- .../custom/no_delimiter/config.pbtxt | 0 .../custom/no_delimiter/expected | 1 + .../unknown_backend.unknown/config.pbtxt | 0 .../custom/unknown_backend.unknown/expected | 2 + .../invalid_input_map/config.pbtxt | 2 +- .../ensemble/non_existing_model/expected | 2 +- .../unreachable_output_3/config.pbtxt | 94 + .../ensemble/unreachable_output_3/expected | 1 + .../openvino/bad_input_dims/config.pbtxt | 12 + .../openvino/bad_input_dims/expected | 1 + .../openvino/bad_output_dims/config.pbtxt | 12 + .../openvino/bad_output_dims/expected | 1 + .../openvino/too_few_inputs/config.pbtxt | 6 + .../openvino/too_few_inputs/expected | 1 + .../openvino/too_many_inputs/config.pbtxt | 18 + .../openvino/too_many_inputs/expected | 1 + .../openvino/unknown_input/config.pbtxt | 24 + .../openvino/unknown_input/expected | 1 + .../openvino/unknown_output/config.pbtxt | 18 + .../openvino/unknown_output/expected | 1 + .../conflicting_max_batch_size/model.py | 15 +- .../conflicting_scheduler_sequence/model.py | 15 +- .../python/input_missing_datatype/model.py | 15 +- .../python/input_missing_dims/model.py | 15 +- .../python/input_missing_name/model.py | 15 +- .../python/input_wrong_property/expected | 2 +- .../python/input_wrong_property/model.py | 20 +- .../config.pbtxt | 24 + .../expected | 1 + .../model.py | 47 + .../config.pbtxt | 28 + .../expected | 1 + .../model.py | 46 + .../python/no_return/model.py | 15 +- .../python/output_missing_datatype/model.py | 15 +- .../python/output_missing_dims/model.py | 15 +- .../python/output_missing_name/model.py | 15 +- .../python/output_wrong_property/model.py | 20 +- .../1/model.savedmodel/saved_model.pb | Bin .../bad_input_dims/config.pbtxt | 0 .../bad_input_dims/expected | 1 + .../1/model.savedmodel/saved_model.pb | Bin .../bad_input_type/config.pbtxt | 0 .../bad_input_type/expected | 1 + .../1/model.savedmodel/saved_model.pb | Bin .../bad_output_dims/config.pbtxt | 0 .../bad_output_dims/expected | 1 + .../1/model.savedmodel/saved_model.pb | Bin .../bad_output_type/config.pbtxt | 0 .../bad_output_type/expected | 1 + .../1/model.savedmodel/saved_model.pb | Bin .../too_many_inputs/config.pbtxt | 2 +- .../too_many_inputs/expected | 1 + .../1/model.savedmodel/saved_model.pb | Bin .../unknown_input/config.pbtxt | 0 .../unknown_input/expected | 1 + .../1/model.savedmodel/saved_model.pb | Bin .../unknown_output/config.pbtxt | 0 .../unknown_output/expected | 1 + .../tensorrt/bad_dynamic_shapes_max/expected | 2 +- .../tensorrt/bad_dynamic_shapes_min/expected | 2 +- .../custom/empty_config.identity/config.pbtxt | 0 .../custom/empty_config.identity/expected | 22 + .../custom/no_backend.identity/config.pbtxt | 15 + .../custom/no_backend.identity/expected | 33 + .../onnx/cpu_instance/config.pbtxt | 0 .../onnx/empty_config/expected | 1 + .../onnx/empty_config/expected.1 | 1 + .../onnx/empty_config/expected.2 | 1 + .../onnx/empty_config/expected.3 | 1 + .../onnx/no_config/expected | 1 + .../onnx/no_config/expected.1 | 1 + .../onnx/no_config/expected.2 | 1 + .../onnx/no_config/expected.3 | 1 + .../openvino/dynamic_batch/config.pbtxt | 0 .../openvino/dynamic_batch/expected | 45 + .../openvino/dynamic_batch/expected.1 | 45 + .../openvino/dynamic_batch/expected.2 | 45 + .../openvino/dynamic_batch/expected.3 | 45 + .../openvino/empty_config/config.pbtxt | 0 .../openvino/empty_config/expected | 45 + .../openvino/empty_config/expected.1 | 45 + .../openvino/empty_config/expected.2 | 45 + .../openvino/empty_config/expected.3 | 45 + .../openvino/no_config/expected | 45 + .../openvino/no_config/expected.1 | 45 + .../openvino/no_config/expected.2 | 45 + .../openvino/no_config/expected.3 | 45 + .../openvino/partial_config/config.pbtxt | 14 + .../partial_config}/expected | 23 +- .../partial_config}/expected.1 | 23 +- .../conflicting_scheduler_ensemble/model.py | 11 +- .../ensemble_first_step/model.py | 11 +- .../ensemble_second_step/model.py | 11 +- .../python/dynamic_batching/expected | 1 + .../python/dynamic_batching/expected.1 | 1 + .../python/dynamic_batching/expected.2 | 1 + .../python/dynamic_batching/expected.3 | 1 + .../python/dynamic_batching/model.py | 15 +- .../python/dynamic_batching_no_op/model.py | 15 +- .../python/incomplete_input/model.py | 13 +- .../model_transaction_policy/config.pbtxt | 24 + .../model_transaction_policy}/expected | 38 +- .../model_transaction_policy}/expected.1 | 34 +- .../model_transaction_policy}/expected.2 | 34 +- .../model_transaction_policy}/expected.3 | 34 +- .../python/model_transaction_policy/model.py | 46 + .../config.pbtxt | 24 + .../expected | 45 + .../expected.1 | 37 +- .../expected.2 | 37 +- .../expected.3 | 33 +- .../model.py | 46 + .../config.pbtxt | 28 + .../model_transaction_policy_no_op}/expected | 34 +- .../expected.1 | 38 +- .../expected.2 | 38 +- .../expected.3 | 38 +- .../model_transaction_policy_no_op/model.py | 46 + .../python/optional_input/config.pbtxt | 7 + .../optional_input/expected} | 27 +- .../python/optional_input/model.py | 48 + .../bad_input_dims/expected.3 | 44 - .../bad_output_dims/expected | 44 - .../bad_output_type/expected | 44 - .../bad_output_type/expected.1 | 44 - .../bad_output_type/expected.2 | 44 - .../empty_config/expected | 1 + .../empty_config/expected.1 | 1 + .../empty_config/expected.2 | 1 + .../empty_config/expected.3 | 1 + .../1/model.savedmodel/saved_model.pb | Bin .../config.pbtxt | 0 .../expected | 4 +- .../expected.1 | 4 +- .../expected.2 | 4 +- .../expected.3 | 4 +- .../1/model.savedmodel/saved_model.pb | Bin 0 -> 1407 bytes .../hint_for_no_batch_2/config.pbtxt | 10 + .../hint_for_no_batch_2/expected | 47 + .../hint_for_no_batch_2/expected.1 | 47 + .../hint_for_no_batch_2/expected.2 | 47 + .../hint_for_no_batch_2/expected.3 | 47 + .../incomplete_input/expected | 1 + .../incomplete_input/expected.1 | 1 + .../incomplete_input/expected.2 | 1 + .../incomplete_input/expected.3 | 1 + .../incomplete_output/expected | 1 + .../incomplete_output/expected.1 | 1 + .../incomplete_output/expected.2 | 1 + .../incomplete_output/expected.3 | 1 + .../kind_model_config/expected | 1 + .../kind_model_config/expected.1 | 1 + .../kind_model_config/expected.2 | 1 + .../kind_model_config/expected.3 | 1 + .../max_batch_size_set/expected | 1 + .../max_batch_size_set/expected.1 | 1 + .../max_batch_size_set/expected.2 | 1 + .../max_batch_size_set/expected.3 | 1 + .../tensorflow_savedmodel/no_config/expected | 1 + .../no_config/expected.1 | 1 + .../no_config/expected.2 | 1 + .../no_config/expected.3 | 1 + .../reshape_config_provided/config.pbtxt | 0 .../reshape_config_provided/expected | 21 +- .../reshape_config_provided/expected.1 | 21 +- .../reshape_config_provided/expected.2 | 21 +- .../reshape_config_provided/expected.3 | 21 +- .../too_many_inputs/expected | 44 - .../too_many_inputs/expected.1 | 44 - .../too_many_inputs/expected.2 | 44 - .../too_many_inputs/expected.3 | 44 - .../unknown_input/expected.3 | 44 - .../unknown_output/expected | 44 - .../unknown_output/expected.1 | 44 - .../unknown_output/expected.2 | 44 - .../unknown_output/expected.3 | 44 - .../tensorrt/empty_config/expected | 1 + .../tensorrt/empty_config_variable/expected | 1 + .../tensorrt/incomplete_input/expected | 1 + .../tensorrt/incomplete_input/expected.1 | 1 + .../tensorrt/incomplete_input/expected.2 | 1 + .../tensorrt/incomplete_input/expected.3 | 1 + .../tensorrt/incomplete_output/expected | 1 + .../tensorrt/incomplete_output/expected.1 | 1 + .../tensorrt/incomplete_output/expected.2 | 1 + .../tensorrt/incomplete_output/expected.3 | 1 + .../tensorrt/multi_prof_max_bs/expected | 1 + .../tensorrt/no_config/expected | 1 + .../tensorrt/no_config_shape_tensor/expected | 1 + .../tensorrt/no_config_variable/expected | 1 + .../tensorrt/no_name_platform/expected | 1 + .../no_name_platform_variable/expected | 1 + .../tensorrt/reshape_config_provided/expected | 1 + .../cli_messages/cli_deprecation/expected | 1 + .../cli_messages/cli_override/expected | 1 + qa/L0_model_config/compare_status.py | 45 +- qa/L0_model_config/noautofill_test.py | 62 + .../noautofill_noconfig/expected | 1 + qa/L0_model_config/test.sh | 186 +- .../python_addsub/__init__.py | 123 + .../python_subadd/__init__.py | 123 + qa/L0_model_namespacing/test.py | 361 ++ qa/L0_model_namespacing/test.sh | 149 + .../addsub_repo/composing_model/1/model.py | 6 + .../addsub_repo/simple_addsub/config.pbtxt | 90 + .../subadd_repo/composing_model/1/model.py | 6 + .../subadd_repo/simple_subadd/config.pbtxt | 88 + .../addsub_repo/composing_model/1/model.py | 6 + .../addsub_repo/simple_addsub/config.pbtxt | 90 + .../subadd_repo/composing_model/1/model.py | 6 + .../subadd_repo/simple_subadd/config.pbtxt | 90 + .../addsub_repo/composing_addsub/1/model.py | 6 + .../addsub_repo/simple_ensemble/config.pbtxt | 90 + .../subadd_repo/composing_subadd/1/model.py | 6 + .../subadd_repo/simple_ensemble/config.pbtxt | 90 + .../addsub_repo/composing_addsub/1/model.py | 6 + .../addsub_repo/simple_addsub/config.pbtxt | 90 + .../subadd_repo/composing_subadd/1/model.py | 6 + .../subadd_repo/simple_subadd/config.pbtxt | 90 + qa/L0_model_queue/model_queue_test.py | 427 +- qa/L0_model_queue/test.sh | 62 +- qa/L0_model_update/instance_update_test.py | 649 +++ qa/L0_model_update/test.sh | 111 + qa/L0_multi_server/test.sh | 0 .../models/nan_inf_output/1/model.py | 12 +- qa/L0_nan_inf/nan_inf_test.py | 46 +- .../nullchar_string_client.py | 63 +- qa/L0_nullchar_string/test.sh | 14 +- qa/L0_onnx_optimization/test.sh | 5 +- .../ensemble_identity_2_float32/config.pbtxt | 0 .../models/identity_2_float32/config.pbtxt | 0 .../optional_connecting_tensor/config.pbtxt | 98 + .../models/optional_identity/1/model.py | 46 + .../models/optional_identity/config.pbtxt | 53 + .../pipeline_identity_2_float32/config.pbtxt | 0 qa/L0_optional_input/optional_input_test.py | 269 +- qa/L0_optional_input/test.sh | 8 +- qa/L0_output_name/output_name_test.py | 29 +- qa/L0_output_name/test.sh | 0 qa/L0_output_validation/lt_op_val_client.py | 18 +- qa/L0_output_validation/test.sh | 0 qa/L0_parallel_copy/parallel_copy_test.py | 81 +- .../model_repository/ensemble/config.pbtxt | 68 + .../model_repository/identity/config.pbtxt | 44 + .../model_repository/parameter/1/model.py | 77 + qa/L0_parameters/parameters_test.py | 223 + qa/L0_parameters/test.sh | 95 + .../config.pbtxt | 0 .../passive_instance_test.py | 17 +- qa/L0_passive_instance/test.sh | 0 .../perf_analyzer_profile_export_schema.json | 95 + qa/L0_perf_analyzer/test.sh | 260 +- qa/L0_perf_analyzer_capi/test.sh | 138 +- qa/L0_perf_analyzer_doc_links/mkdocs.yml | 36 + qa/L0_perf_analyzer_doc_links/test.sh | 73 + qa/L0_perf_analyzer_ground_truth/test.sh | 175 + qa/L0_perf_analyzer_report/test.sh | 17 +- qa/L0_perf_deeprecommender/run_test.sh | 17 +- qa/L0_perf_deeprecommender/test.sh | 4 +- qa/L0_perf_kaldi/create_data.sh | 2 +- qa/L0_perf_kaldi/test.sh | 0 qa/L0_perf_nomodel/run_test.sh | 31 +- qa/L0_perf_nomodel/test.sh | 6 +- qa/L0_perf_pyclients/simple_perf_client.py | 318 +- qa/L0_perf_pyclients/test.sh | 6 +- qa/L0_perf_resnet/run_test.sh | 25 +- qa/L0_perf_resnet/test.sh | 8 +- qa/L0_perf_tfs/test.sh | 153 - qa/L0_perf_ts/test.sh | 124 - qa/L0_perf_vllm/test.sh | 146 + qa/L0_pinned_memory/test.sh | 14 +- qa/L0_python_api/test.sh | 50 + .../test.sh | 35 +- qa/L0_query/query_e2e.py | 113 +- qa/L0_query/test.sh | 0 qa/L0_rate_limiter/rate_limiter_test.py | 143 +- qa/L0_rate_limiter/test.sh | 22 +- qa/L0_register/test.sh | 0 qa/L0_repoagent_checksum/identity_test.py | 68 +- .../grpc_cancellation_test.py | 141 + qa/L0_request_cancellation/scheduler_test.py | 233 + qa/L0_request_cancellation/test.sh | 183 + .../models/decoupled_cache/config.pbtxt | 49 + .../models/identity_cache/config.pbtxt | 46 + qa/L0_response_cache/test.sh | 239 +- qa/L0_sagemaker/sagemaker_multi_model_test.py | 233 +- qa/L0_sagemaker/sagemaker_test.py | 338 +- qa/L0_sagemaker/test.sh | 34 +- .../saved_model_shape_test.py | 306 +- qa/L0_savedmodel_shape/test.sh | 0 qa/L0_scalar_io/scalar_test.py | 71 + qa/L0_scalar_io/test.sh | 93 + qa/L0_sdk/grpc_test.cc | 1 + qa/L0_sdk/http_test.cc | 1 + qa/L0_sdk/test.sh | 6 +- qa/L0_secure_grpc/test.sh | 22 +- .../config.pbtxt | 62 + .../sequence_batcher_test.py | 2718 ++++++---- qa/L0_sequence_batcher/test.sh | 334 +- .../sequence_corrid_batcher_test.py | 140 +- qa/L0_sequence_corrid_batcher/test.sh | 4 +- qa/L0_sequence_stress/sequence_stress.py | 429 +- qa/L0_sequence_stress/test.sh | 4 +- qa/L0_server_status/server_status_test.py | 535 +- qa/L0_shared_memory/shared_memory_test.py | 165 +- qa/L0_shared_memory/test.sh | 2 +- qa/L0_simple_ensemble/ensemble_test.py | 75 +- qa/L0_simple_go_client/test.sh | 31 +- qa/L0_simple_nodejs_client/test.sh | 0 qa/L0_socket/test.sh | 116 +- qa/L0_storage_S3/infer_test.py | 174 - qa/L0_storage_S3/test.sh | 199 +- qa/L0_storage_S3_local/mock_s3_service.py | 113 + .../test.sh | 207 +- qa/L0_storage_azure/infer_test.py | 174 - qa/L0_storage_azure/test.sh | 135 +- qa/L0_storage_swiftstack/infer_test.py | 274 +- qa/L0_string_io/string_client_test.py | 159 +- qa/L0_tf_gpu_io/test.sh | 76 +- qa/L0_tf_gpu_io/tf_gpu_io_test.py | 105 + qa/L0_tf_parameters/test.sh | 150 + qa/L0_tf_parameters/tf_parameter_test.py | 81 + qa/L0_tf_tag_sigdef/test.sh | 17 +- qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py | 19 +- qa/L0_tf_unknown_rank/test.sh | 7 +- qa/L0_tf_unknown_rank/tf_unknown_rank_test.py | 35 +- .../tftrt_optimization_test.py | 40 +- qa/L0_trace/opentelemetry_unittest.py | 274 + qa/L0_trace/test.sh | 541 +- qa/L0_trace/trace-config.yaml | 45 + qa/L0_trace/trace_endpoint_test.py | 455 +- qa/L0_triton_repo_agent/test.sh | 0 qa/L0_trt_compat/test.sh | 110 + qa/L0_trt_compat/trt_compatibility_test.py | 50 + qa/L0_trt_data_dependent_shape/test.sh | 94 + .../trt_data_dependent_shape_test.py | 85 + qa/L0_trt_dla/dla_test.py | 27 +- qa/L0_trt_dla/test.sh | 0 qa/L0_trt_dynamic_shape/test.sh | 2 +- .../trt_dynamic_shape_test.py | 84 +- qa/L0_trt_error_propagation/test.sh | 82 + .../trt_error_propagation_test.py | 72 + qa/L0_trt_plugin/test.sh | 186 +- qa/L0_trt_plugin/trt_plugin_test.py | 99 +- qa/L0_trt_reformat_free/test.sh | 3 +- .../trt_reformat_free_test.py | 205 +- qa/L0_trt_shape_tensors/test.sh | 8 +- .../trt_shape_tensor_test.py | 688 ++- qa/L0_vertex_ai/test.sh | 4 +- qa/L0_vertex_ai/vertex_ai_test.py | 251 +- qa/L0_warmup/decoupled/1/model.py | 7 +- qa/L0_warmup/failing_infer/1/model.py | 9 +- qa/L0_warmup/test.sh | 85 +- qa/common/check_copyright.py | 189 +- qa/common/check_massif_log.py | 45 +- qa/common/check_valgrind_log.py | 42 +- qa/common/cuda_op_kernel.cu.cc.patch | 8 +- qa/common/gen_common.py | 160 + qa/common/gen_ensemble_model_utils.py | 653 ++- qa/common/gen_jetson_trt_models | 188 + qa/common/gen_qa_custom_ops | 45 +- qa/common/gen_qa_custom_ops_models.py | 271 +- .../gen_qa_dyna_sequence_implicit_models.py | 542 +- qa/common/gen_qa_dyna_sequence_models.py | 1020 ++-- qa/common/gen_qa_identity_models.py | 1179 +++-- qa/common/gen_qa_implicit_models.py | 1290 +++-- qa/common/gen_qa_model_repository | 156 +- qa/common/gen_qa_models.py | 2824 ++++++---- qa/common/gen_qa_noshape_models.py | 513 +- qa/common/gen_qa_ort_scalar_models.py | 130 + qa/common/gen_qa_pytorch_model.py | 124 + qa/common/gen_qa_ragged_models.py | 706 +-- qa/common/gen_qa_reshape_models.py | 1596 +++--- qa/common/gen_qa_sequence_models.py | 1000 ++-- qa/common/gen_qa_tf_parameters.py | 122 + qa/common/gen_qa_torchtrt_models.py | 34 +- qa/common/gen_qa_trt_data_dependent_shape.py | 158 + qa/common/gen_qa_trt_format_models.py | 402 +- qa/common/gen_qa_trt_plugin_models.py | 366 +- qa/common/gen_tag_sigdef.py | 255 +- qa/common/gen_xavier_trt_models | 118 - qa/common/infer_test.py | 220 + qa/common/infer_util.py | 926 ++-- .../non_aligned_validation_batched.json | 56 +- .../non_aligned_validation_no_batch.json | 56 +- .../simple_model.py | 106 +- .../validation_batched.json | 64 +- .../validation_no_batch.json | 64 +- .../wrong_validation_batched.json | 64 +- .../wrong_validation_no_batch.json | 64 +- qa/common/libtorch_infer_client.py | 45 +- qa/common/nightly_email_helper.py | 41 +- .../int_data.json | 4 +- .../int_data_diff_shape.json | 4 +- .../int_data_optional.json | 14 + .../perf_analyzer_input_data_json/output.json | 2 +- .../repeat_int32_data.json | 31 + .../string_data_with_shape.json | 8 +- .../wrong_output.json | 2 +- .../wrong_output_2.json | 2 +- qa/common/reporter.py | 203 +- qa/common/sequence_util.py | 836 +-- qa/common/shm_util.py | 330 +- qa/common/test_util.py | 201 +- qa/common/trace_summary.py | 352 +- qa/common/util.sh | 124 +- .../custom_zero_1_float32/config.pbtxt | 0 qa/openvino_models/README.md | 34 + qa/openvino_models/dynamic_batch/1/model.bin | 0 .../dynamic_batch/1/model.mapping | 195 + qa/openvino_models/dynamic_batch/1/model.xml | 166 + qa/openvino_models/fixed_batch/1/model.bin | 0 .../fixed_batch/1/model.mapping | 211 + qa/openvino_models/fixed_batch/1/model.xml | 152 + qa/python_models/add_sub/config.pbtxt | 1 - qa/python_models/add_sub/model.py | 50 +- qa/python_models/add_sub_gpu/config.pbtxt | 8 +- qa/python_models/auto_complete/model.py | 58 +- qa/python_models/auto_complete_error/model.py | 15 +- qa/python_models/bls/model.py | 712 ++- qa/python_models/bls_async/model.py | 172 +- .../bls_finalize_error/config.pbtxt | 38 + qa/python_models/bls_finalize_error/model.py | 45 + qa/python_models/bls_init_error/config.pbtxt | 38 + qa/python_models/bls_init_error/model.py | 44 + qa/python_models/bls_memory/model.py | 68 +- qa/python_models/bls_memory_async/model.py | 48 +- .../bls_model_loading/config.pbtxt | 43 + qa/python_models/bls_model_loading/model.py | 135 + qa/python_models/bls_onnx_warmup/config.pbtxt | 38 + qa/python_models/bls_onnx_warmup/model.py | 88 + qa/python_models/bls_parameters/config.pbtxt | 52 + qa/python_models/bls_parameters/model.py | 77 + .../bls_request_rescheduling/config.pbtxt | 38 + .../bls_request_rescheduling/model.py | 133 + qa/python_models/bls_simple/bls_simple.py | 84 + qa/python_models/bls_undefined/config.pbtxt | 50 + qa/python_models/bls_undefined/model.py | 33 + .../cuda_memory_consumer/1/model.py | 69 + .../cuda_memory_consumer/config.pbtxt | 28 + qa/python_models/custom_metrics/config.pbtxt | 43 + qa/python_models/custom_metrics/model.py | 278 + qa/python_models/delayed_model/model.py | 10 +- qa/python_models/dlpack_add_sub/model.py | 101 +- .../dlpack_empty_output/config.pbtxt | 43 + qa/python_models/dlpack_empty_output/model.py | 53 + qa/python_models/dlpack_identity/model.py | 8 +- qa/python_models/dlpack_io_identity/model.py | 53 +- .../dlpack_io_identity_decoupled/model.py | 43 +- qa/python_models/dlpack_square/config.pbtxt | 48 + qa/python_models/dlpack_square/model.py | 139 + qa/python_models/dlpack_sub_add/model.py | 101 +- qa/python_models/dlpack_test/model.py | 312 +- qa/python_models/error_code/config.pbtxt | 47 + qa/python_models/error_code/model.py | 59 + qa/python_models/execute_cancel/config.pbtxt | 47 + qa/python_models/execute_cancel/model.py | 108 + qa/python_models/execute_error/model.py | 19 +- .../execute_return_error/model.py | 5 +- qa/python_models/fini_error/model.py | 3 +- qa/python_models/ground_truth/config.pbtxt | 52 + qa/python_models/ground_truth/model.py | 51 + qa/python_models/identity_fp32/model.py | 3 +- .../identity_fp32_logging/config.pbtxt | 53 + .../identity_fp32_logging/model.py | 72 + .../identity_fp32_timeout/config.pbtxt | 60 + .../identity_fp32_timeout/model.py | 45 + qa/python_models/init_args/model.py | 46 +- qa/python_models/init_error/model.py | 5 +- qa/python_models/init_exit/config.pbtxt | 46 + qa/python_models/init_exit/model.py | 40 + .../iterative_sequence/config.pbtxt | 51 + qa/python_models/iterative_sequence/model.py | 131 + qa/python_models/model_env/model.py | 9 +- qa/python_models/model_init_del/config.pbtxt | 52 + qa/python_models/model_init_del/model.py | 57 + qa/python_models/model_init_del/util.py | 189 + qa/python_models/multi_file/file1.py | 6 +- qa/python_models/multi_file/file2.py | 6 +- qa/python_models/multi_file/model.py | 11 +- qa/python_models/non_contiguous/model.py | 11 +- qa/python_models/optional/config.pbtxt | 7 - qa/python_models/optional/model.py | 14 +- .../add_sub_backend/model.py | 162 + qa/python_models/python_version/model.py | 27 +- qa/python_models/pytorch_fp32_fp32/model.py | 6 +- .../request_rescheduling_addsub/config.pbtxt | 61 + .../request_rescheduling_addsub/model.py | 82 + .../response_sender_error/model.py | 37 +- qa/python_models/sequence_int32/config.pbtxt | 80 + qa/python_models/sequence_int32/model.py | 92 + .../python_models/sequence_py/config.pbtxt | 48 +- qa/python_models/sequence_py/model.py | 93 + qa/python_models/string/model.py | 8 +- qa/python_models/string_fixed/model.py | 28 +- qa/python_models/string_identity/model.py | 14 +- qa/python_models/sub_add/model.py | 54 +- .../torchvision/resnet50/config.pbtxt | 40 + .../torchvision/resnet50/model.py | 62 + .../variable_gpu_output/config.pbtxt | 55 + qa/python_models/variable_gpu_output/model.py | 46 + qa/python_models/wrong_model/model.py | 3 +- .../wrong_return_type/config.pbtxt | 49 + qa/python_models/wrong_return_type/model.py | 67 + src/CMakeLists.txt | 188 +- src/classification.cc | 1 + src/classification.h | 1 + src/command_line_parser.cc | 2244 ++++++++ src/command_line_parser.h | 345 ++ src/common.cc | 12 +- src/common.h | 55 +- src/data_compressor.h | 13 +- src/grpc/CMakeLists.txt | 144 + src/grpc/grpc_handler.h | 46 + src/grpc/grpc_server.cc | 2552 +++++++++ src/grpc/grpc_server.h | 139 + src/grpc/grpc_utils.cc | 160 + src/grpc/grpc_utils.h | 187 + src/grpc/infer_handler.cc | 1068 ++++ src/grpc/infer_handler.h | 1436 +++++ src/grpc/stream_infer_handler.cc | 732 +++ src/grpc/stream_infer_handler.h | 124 + src/grpc_server.cc | 4621 ----------------- src/grpc_server.h | 132 - src/http_server.cc | 2723 +++++++--- src/http_server.h | 297 +- src/main.cc | 1646 +----- src/memory_alloc.cc | 2 + src/multi_server.cc | 2 + src/restricted_features.h | 114 + src/sagemaker_server.cc | 637 ++- src/sagemaker_server.h | 55 +- src/shared_memory_manager.cc | 10 +- src/shared_memory_manager.h | 1 + src/simple.cc | 44 +- src/test/CMakeLists.txt | 8 +- src/test/caffe2plan.cc | 7 +- src/test/data_compressor_test.cc | 6 +- .../src/distributed_addsub.cc | 11 +- src/test/dyna_sequence/src/dyna_sequence.cc | 1 + src/test/implicit_state/src/implicit_state.cc | 204 +- src/test/iterative_sequence/CMakeLists.txt | 118 + ...tonIterativeSequenceBackendConfig.cmake.in | 39 + .../src/iterative_sequence.cc | 582 +++ .../src/libtriton_iterative_sequence.ldscript | 22 +- src/test/query_backend/src/query.cc | 5 +- .../relocation_repoagent/src/relocation.cc | 8 +- src/test/sequence/src/sequence.cc | 1 + src/tracer.cc | 778 ++- src/tracer.h | 214 +- src/triton_signal.h | 1 + src/vertex_ai_server.cc | 6 +- src/vertex_ai_server.h | 2 +- 883 files changed, 76704 insertions(+), 32486 deletions(-) create mode 100644 .github/workflows/codeql.yml create mode 100644 .github/workflows/pre-commit.yaml create mode 100644 .pre-commit-config.yaml create mode 100644 CITATION.cff create mode 100644 SECURITY.md mode change 100644 => 100755 compose.py mode change 100644 => 100755 deploy/gke-marketplace-app/benchmark/perf-analyzer-script/perf_query.sh rename deploy/gke-marketplace-app/client-sample/{locustfile_bert_large.py => locustfile_bert.py} (87%) mode change 100644 => 100755 mode change 100644 => 100755 deploy/gke-marketplace-app/client-sample/perf_analyzer_grpc.sh mode change 100644 => 100755 deploy/gke-marketplace-app/server-deployer/build_and_push.sh create mode 100644 deploy/gke-marketplace-app/server-deployer/chart/triton/templates/ingress.yaml create mode 100644 deploy/gke-marketplace-app/trt-engine/README.md mode change 100755 => 100644 deploy/mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt mode change 100644 => 100755 deploy/mlflow-triton-plugin/mlflow_triton/__init__.py mode change 100644 => 100755 deploy/mlflow-triton-plugin/mlflow_triton/config.py mode change 100644 => 100755 deploy/mlflow-triton-plugin/mlflow_triton/deployments.py mode change 100644 => 100755 deploy/mlflow-triton-plugin/scripts/publish_model_to_mlflow.py mode change 100644 => 100755 deploy/mlflow-triton-plugin/scripts/triton_flavor.py mode change 100644 => 100755 deploy/mlflow-triton-plugin/setup.py mode change 100644 => 100755 docker/cpu_only/entrypoint.d/12-banner.sh mode change 100644 => 100755 docker/cpu_only/entrypoint.d/50-gpu-driver-check2.sh mode change 100644 => 100755 docker/entrypoint.d/50-gpu-driver-check2.sh mode change 100644 => 100755 docker/entrypoint.d/99-check-run-aip-mode.sh create mode 100644 docs/.gitignore create mode 100644 docs/Dockerfile.docs create mode 100644 docs/Makefile create mode 100644 docs/_static/.gitattributes create mode 100644 docs/_static/custom.css create mode 100644 docs/_static/logo_2color_horizontal.svg create mode 100644 docs/_static/logo_2color_vertical.svg create mode 100644 docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png create mode 100644 docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png create mode 100644 docs/_static/rtd-data.js create mode 100644 docs/_templates/layout.html create mode 100755 docs/conf.py create mode 100644 docs/contents.md rename docs/{ => customization_guide}/build.md (90%) rename docs/{ => customization_guide}/compose.md (89%) create mode 100644 docs/customization_guide/deploy.md rename docs/{ => customization_guide}/inference_protocols.md (65%) rename docs/{ => customization_guide}/repository_agents.md (98%) rename docs/{ => customization_guide}/test.md (96%) create mode 100644 docs/examples/README.md mode change 100644 => 100755 docs/examples/jetson/concurrency_and_dynamic_batching/tao/convert_peoplenet.sh mode change 100755 => 100644 docs/examples/model_repository/simple_identity/config.pbtxt rename docs/{ => getting_started}/quickstart.md (84%) create mode 100644 docs/index.md delete mode 100644 docs/metrics.md delete mode 100644 docs/perf_analyzer.md create mode 100644 docs/protocol/extension_generate.md create mode 100644 docs/protocol/extension_logging.md create mode 100644 docs/protocol/extension_parameters.md delete mode 100644 docs/response_cache.md delete mode 100644 docs/trace.md rename docs/{ => user_guide}/architecture.md (96%) rename docs/{ => user_guide}/custom_operations.md (81%) create mode 100644 docs/user_guide/debugging_guide.md rename docs/{ => user_guide}/decoupled_models.md (73%) rename docs/{ => user_guide}/faq.md (68%) rename docs/{ => user_guide}/images/arch.jpg (100%) rename docs/{ => user_guide}/images/dyna_sequence_example0.png (100%) rename docs/{ => user_guide}/images/dyna_sequence_example1.png (100%) rename docs/{ => user_guide}/images/ensemble_example0.png (100%) rename docs/{ => user_guide}/images/multi_model_exec.png (100%) rename docs/{ => user_guide}/images/multi_model_parallel_exec.png (100%) rename docs/{ => user_guide}/images/multi_model_serial_exec.png (100%) rename docs/{ => user_guide}/images/sequence_example0.png (100%) rename docs/{ => user_guide}/images/sequence_example1.png (100%) rename docs/{ => user_guide}/images/sequence_example2.png (100%) rename docs/{ => user_guide}/images/triton_on_jetson.png (100%) rename docs/{ => user_guide}/jetson.md (78%) create mode 100644 docs/user_guide/metrics.md create mode 100644 docs/user_guide/model_analyzer.md rename docs/{ => user_guide}/model_configuration.md (76%) rename docs/{ => user_guide}/model_management.md (66%) rename docs/{ => user_guide}/model_repository.md (67%) rename docs/{ => user_guide}/optimization.md (85%) create mode 100644 docs/user_guide/perf_analyzer.md create mode 100644 docs/user_guide/performance_tuning.md rename docs/{ => user_guide}/ragged_batching.md (97%) rename docs/{ => user_guide}/rate_limiter.md (98%) create mode 100644 docs/user_guide/request_cancellation.md create mode 100644 docs/user_guide/response_cache.md create mode 100644 docs/user_guide/trace.md rename docs/{ => user_guide}/v1_to_v2.md (95%) create mode 100644 pyproject.toml mode change 100644 => 100755 qa/L0_async_work_queue/test.sh mode change 100644 => 100755 qa/L0_backend_config/test.sh create mode 100755 qa/L0_backend_fastertransformer/test.sh mode change 100644 => 100755 qa/L0_backend_identity/identity_test.py create mode 100755 qa/L0_backend_output_detail/test.sh mode change 100644 => 100755 qa/L0_backend_python/argument_validation/test.sh create mode 100755 qa/L0_backend_python/bls/bls_parameters_test.py mode change 100644 => 100755 qa/L0_backend_python/bls/test.sh mode change 100644 => 100755 qa/L0_backend_python/common.sh create mode 100755 qa/L0_backend_python/custom_metrics/test.sh mode change 100644 => 100755 qa/L0_backend_python/decoupled/decoupled_test.py create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py create mode 100644 qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt mode change 100644 => 100755 qa/L0_backend_python/decoupled/test.sh mode change 100644 => 100755 qa/L0_backend_python/ensemble/ensemble_test.py mode change 100644 => 100755 qa/L0_backend_python/ensemble/test.sh mode change 100644 => 100755 qa/L0_backend_python/env/test.sh mode change 100644 => 100755 qa/L0_backend_python/examples/test.sh mode change 100644 => 100755 qa/L0_backend_python/io/io_test.py mode change 100644 => 100755 qa/L0_backend_python/io/test.sh mode change 100644 => 100755 qa/L0_backend_python/lifecycle/lifecycle_test.py mode change 100644 => 100755 qa/L0_backend_python/lifecycle/test.sh create mode 100755 qa/L0_backend_python/logging/logging_test.py create mode 100755 qa/L0_backend_python/logging/test.sh mode change 100644 => 100755 qa/L0_backend_python/model_control/model_control_test.py mode change 100644 => 100755 qa/L0_backend_python/model_control/test.sh create mode 100644 qa/L0_backend_python/python_based_backends/python_based_backends_test.py create mode 100755 qa/L0_backend_python/python_based_backends/test.sh mode change 100644 => 100755 qa/L0_backend_python/python_test.py mode change 100644 => 100755 qa/L0_backend_python/python_unittest.py create mode 100755 qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py create mode 100755 qa/L0_backend_python/request_rescheduling/test.sh mode change 100644 => 100755 qa/L0_backend_python/restart/restart_test.py mode change 100644 => 100755 qa/L0_backend_python/restart/test.sh create mode 100755 qa/L0_backend_python/setup_python_enviroment.sh mode change 100644 => 100755 qa/L0_backend_python/variants/test.sh create mode 100755 qa/L0_batch_custom/batch_custom_test.py create mode 100755 qa/L0_batch_custom/test.sh mode change 100644 => 100755 qa/L0_batch_input/batch_input_test.py mode change 100644 => 100755 qa/L0_batch_input/test.sh mode change 100644 => 100755 qa/L0_batcher/batcher_test.py mode change 100644 => 100755 qa/L0_batcher/verify_timestamps.py mode change 100644 => 100755 qa/L0_buffer_attributes/buffer_attributes_test.py mode change 100644 => 100755 qa/L0_buffer_attributes/test.sh mode change 100644 => 100755 qa/L0_client_java/test.sh mode change 100644 => 100755 qa/L0_client_memory_growth/client_memory_mail.py mode change 100644 => 100755 qa/L0_client_nobatch/client_test.py rename qa/L0_client_timeout/{client_timeout_test.py => client_infer_timeout_test.py} (61%) mode change 100644 => 100755 create mode 100755 qa/L0_client_timeout/client_non_infer_timeout_test.py mode change 100644 => 100755 qa/L0_client_timeout/test.sh create mode 100755 qa/L0_cmdline_trace/trace_client.py create mode 100644 qa/L0_config_json/max_priority_level.pbtxt mode change 100644 => 100755 qa/L0_cuda_graph/test.sh mode change 100644 => 100755 qa/L0_cuda_graph/trt_cuda_graph_test.py mode change 100644 => 100755 qa/L0_cuda_shared_memory/cuda_shared_memory_test.py mode change 100644 => 100755 qa/L0_cuda_shared_memory/test.sh mode change 100644 => 100755 qa/L0_custom_ops/cuda_op_test.py mode change 100644 => 100755 qa/L0_custom_ops/mod_op_test.py mode change 100644 => 100755 qa/L0_custom_ops/onnx_op_test.py mode change 100644 => 100755 qa/L0_custom_ops/vision_op_test.py mode change 100644 => 100755 qa/L0_custom_ops/zero_out_test.py mode change 100644 => 100755 qa/L0_data_compression/test.sh mode change 100644 => 100755 qa/L0_data_compression/validation.py mode change 100644 => 100755 qa/L0_decoupled/decoupled_test.py mode change 100644 => 100755 qa/L0_decoupled/test.sh create mode 100755 qa/L0_device_memory_tracker/test.py create mode 100755 qa/L0_device_memory_tracker/test.sh rename qa/{L0_backend_python/unittest => L0_dlpack_multi_gpu}/test.sh (79%) mode change 100644 => 100755 create mode 100644 qa/L0_doc_links/mkdocs.yml create mode 100755 qa/L0_doc_links/test.sh mode change 100644 => 100755 qa/L0_dyna_implicit_state/test.sh mode change 100644 => 100755 qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py create mode 100644 qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py rename docs/model_analyzer.md => qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt (63%) create mode 100755 qa/L0_grpc/grpc_basic_auth_test.py create mode 100755 qa/L0_grpc/grpc_client_plugin_test.py create mode 100644 qa/L0_grpc/nginx.conf create mode 100755 qa/L0_grpc/python_grpc_aio_test.py create mode 100755 qa/L0_grpc/python_unit_test.py mode change 100644 => 100755 qa/L0_grpc/test.sh create mode 100755 qa/L0_grpc_state_cleanup/cleanup_test.py create mode 100755 qa/L0_grpc_state_cleanup/test.sh create mode 100755 qa/L0_http/generate_endpoint_test.py create mode 100644 qa/L0_http/generate_models/mock_llm/1/model.py create mode 100644 qa/L0_http/generate_models/mock_llm/config.pbtxt create mode 100755 qa/L0_http/http_basic_auth_test.py create mode 100755 qa/L0_http/http_client_plugin_test.py create mode 100755 qa/L0_http/http_restricted_api_test.py mode change 100644 => 100755 qa/L0_http/http_test.py create mode 100644 qa/L0_http/nginx.conf create mode 100755 qa/L0_http/python_http_aio_test.py mode change 100644 => 100755 qa/L0_http/test.sh mode change 100644 => 100755 qa/L0_http_fuzz/fuzztest.py mode change 100644 => 100755 qa/L0_http_fuzz/test.sh mode change 100644 => 100755 qa/L0_https/test.sh mode change 100644 => 100755 qa/L0_implicit_state/implicit_state.py create mode 100644 qa/L0_implicit_state/models/growable_memory/config.pbtxt create mode 100644 qa/L0_implicit_state/models/single_state_buffer/config.pbtxt mode change 100644 => 100755 qa/L0_implicit_state/test.sh mode change 100644 => 100755 qa/L0_infer/infer_test.py mode change 100644 => 100755 qa/L0_infer_reshape/infer_reshape_test.py mode change 100644 => 100755 qa/L0_infer_variable/infer_variable_test.py mode change 100644 => 100755 qa/L0_infer_zero/infer_zero_test.py mode change 100644 => 100755 qa/L0_inferentia_perf_analyzer/test.sh create mode 100755 qa/L0_iterative_sequence/iterative_sequence_e2e.py create mode 100644 qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt create mode 100755 qa/L0_iterative_sequence/test.sh create mode 100755 qa/L0_json/test.sh mode change 100644 => 100755 qa/L0_large_payload/large_payload_test.py mode change 100644 => 100755 qa/L0_large_payload/test.sh mode change 100644 => 100755 qa/L0_libtorch_inference_mode/test.sh create mode 100755 qa/L0_libtorch_instance_group_kind_model/client.py create mode 100755 qa/L0_libtorch_instance_group_kind_model/gen_models.py create mode 100644 qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt create mode 100755 qa/L0_libtorch_instance_group_kind_model/test.sh mode change 100644 => 100755 qa/L0_libtorch_io_names/io_names_client.py mode change 100644 => 100755 qa/L0_libtorch_io_names/test.sh create mode 100755 qa/L0_libtorch_io_types/test.sh mode change 100644 => 100755 qa/L0_libtorch_nvfuser/test.sh mode change 100644 => 100755 qa/L0_libtorch_optimized_execution/test.sh mode change 100644 => 100755 qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py mode change 100644 => 100755 qa/L0_libtorch_shared_weights/test.sh mode change 100644 => 100755 qa/L0_lifecycle/lifecycle_test.py create mode 100755 qa/L0_logging/logging_endpoint_test.py create mode 100755 qa/L0_logging/test.sh mode change 100644 => 100755 qa/L0_long_running_stress/crashing_client.py mode change 100644 => 100755 qa/L0_long_running_stress/scenarios.py mode change 100644 => 100755 qa/L0_long_running_stress/stress.py mode change 100644 => 100755 qa/L0_long_running_stress/stress_mail.py mode change 100644 => 100755 qa/L0_memory/test.sh mode change 100644 => 100755 qa/L0_memory_growth/busy_op_test.py mode change 100644 => 100755 qa/L0_memory_growth/server_memory_mail.py create mode 100644 qa/L0_metrics/ensemble_delay/config.pbtxt create mode 100644 qa/L0_metrics/identity_delay/config.pbtxt create mode 100755 qa/L0_metrics/metrics_config_test.py create mode 100755 qa/L0_metrics/metrics_queue_size_test.py create mode 100644 qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt create mode 100644 qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt mode change 100644 => 100755 qa/L0_mlflow/plugin_test.py mode change 100644 => 100755 qa/L0_mlflow/test.sh create mode 100644 qa/L0_model_config/autofill_noplatform/custom/no_delimiter/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected create mode 100644 qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected create mode 100644 qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected create mode 100644 qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_dims/config.pbtxt (100%) create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_input_type/config.pbtxt (100%) create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_dims/config.pbtxt (100%) create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/bad_output_type/config.pbtxt (100%) create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected rename qa/L0_model_config/{autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch => autofill_noplatform/tensorflow_savedmodel/too_many_inputs}/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/too_many_inputs/config.pbtxt (93%) create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected rename qa/L0_model_config/{autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs => autofill_noplatform/tensorflow_savedmodel/unknown_input}/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/unknown_input/config.pbtxt (100%) create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected rename qa/L0_model_config/{autofill_noplatform_success/tensorflow_savedmodel/unknown_input => autofill_noplatform/tensorflow_savedmodel/unknown_output}/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/{autofill_noplatform_success => autofill_noplatform}/tensorflow_savedmodel/unknown_output/config.pbtxt (100%) create mode 100644 qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected mode change 100755 => 100644 qa/L0_model_config/autofill_noplatform_success/onnx/cpu_instance/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3 create mode 100644 qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/unknown_input => openvino/partial_config}/expected (62%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/unknown_input => openvino/partial_config}/expected.1 (62%) create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_dims => python/model_transaction_policy}/expected (51%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy}/expected.1 (51%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy}/expected.2 (51%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy}/expected.3 (51%) create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_dims => python/model_transaction_policy_decoupled_false}/expected.1 (50%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_dims => python/model_transaction_policy_decoupled_false}/expected.2 (50%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_type => python/model_transaction_policy_decoupled_false}/expected.3 (50%) create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_type => python/model_transaction_policy_no_op}/expected (50%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_input_dims => python/model_transaction_policy_no_op}/expected.1 (50%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_dims => python/model_transaction_policy_no_op}/expected.2 (50%) rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/bad_output_dims => python/model_transaction_policy_no_op}/expected.3 (50%) create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt rename qa/L0_model_config/autofill_noplatform_success/{tensorflow_savedmodel/unknown_input/expected.2 => python/optional_input/expected} (52%) create mode 100644 qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2 rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{unknown_output => hint_for_no_batch_1}/1/model.savedmodel/saved_model.pb (100%) rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/config.pbtxt (100%) rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected (91%) rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected.1 (91%) rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected.2 (91%) rename qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/{hint_for_no_batch => hint_for_no_batch_1}/expected.3 (91%) create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/1/model.savedmodel/saved_model.pb create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/config.pbtxt create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.1 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.2 create mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/expected.3 mode change 100755 => 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/reshape_config_provided/config.pbtxt delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.1 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.2 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/expected.3 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.3 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.1 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.2 delete mode 100644 qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/expected.3 create mode 100644 qa/L0_model_config/cli_messages/cli_deprecation/expected create mode 100644 qa/L0_model_config/cli_messages/cli_override/expected mode change 100644 => 100755 qa/L0_model_config/compare_status.py create mode 100755 qa/L0_model_config/noautofill_test.py create mode 100644 qa/L0_model_config/special_cases/noautofill_noconfig/expected create mode 100755 qa/L0_model_namespacing/python_addsub/__init__.py create mode 100755 qa/L0_model_namespacing/python_subadd/__init__.py create mode 100755 qa/L0_model_namespacing/test.py create mode 100755 qa/L0_model_namespacing/test.sh create mode 100644 qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py create mode 100644 qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt create mode 100644 qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py create mode 100644 qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py create mode 100644 qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py create mode 100644 qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt create mode 100644 qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py create mode 100644 qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt create mode 100644 qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py create mode 100644 qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt mode change 100644 => 100755 qa/L0_model_queue/model_queue_test.py mode change 100644 => 100755 qa/L0_model_queue/test.sh create mode 100755 qa/L0_model_update/instance_update_test.py create mode 100755 qa/L0_model_update/test.sh mode change 100644 => 100755 qa/L0_multi_server/test.sh mode change 100644 => 100755 qa/L0_nan_inf/nan_inf_test.py mode change 100644 => 100755 qa/L0_nullchar_string/nullchar_string_client.py mode change 100644 => 100755 qa/L0_nullchar_string/test.sh mode change 100755 => 100644 qa/L0_optional_input/models/ensemble_identity_2_float32/config.pbtxt mode change 100755 => 100644 qa/L0_optional_input/models/identity_2_float32/config.pbtxt create mode 100644 qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt create mode 100644 qa/L0_optional_input/models/optional_identity/1/model.py create mode 100644 qa/L0_optional_input/models/optional_identity/config.pbtxt mode change 100755 => 100644 qa/L0_optional_input/models/pipeline_identity_2_float32/config.pbtxt mode change 100644 => 100755 qa/L0_optional_input/optional_input_test.py mode change 100644 => 100755 qa/L0_output_name/output_name_test.py mode change 100644 => 100755 qa/L0_output_name/test.sh mode change 100644 => 100755 qa/L0_output_validation/lt_op_val_client.py mode change 100644 => 100755 qa/L0_output_validation/test.sh mode change 100644 => 100755 qa/L0_parallel_copy/parallel_copy_test.py create mode 100644 qa/L0_parameters/model_repository/ensemble/config.pbtxt create mode 100644 qa/L0_parameters/model_repository/identity/config.pbtxt create mode 100644 qa/L0_parameters/model_repository/parameter/1/model.py create mode 100755 qa/L0_parameters/parameters_test.py create mode 100755 qa/L0_parameters/test.sh mode change 100755 => 100644 qa/L0_passive_instance/models/distributed_int32_int32_int32/config.pbtxt mode change 100644 => 100755 qa/L0_passive_instance/passive_instance_test.py mode change 100644 => 100755 qa/L0_passive_instance/test.sh create mode 100644 qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json create mode 100644 qa/L0_perf_analyzer_doc_links/mkdocs.yml create mode 100755 qa/L0_perf_analyzer_doc_links/test.sh create mode 100755 qa/L0_perf_analyzer_ground_truth/test.sh mode change 100644 => 100755 qa/L0_perf_kaldi/create_data.sh mode change 100644 => 100755 qa/L0_perf_kaldi/test.sh mode change 100644 => 100755 qa/L0_perf_pyclients/simple_perf_client.py delete mode 100755 qa/L0_perf_tfs/test.sh delete mode 100755 qa/L0_perf_ts/test.sh create mode 100755 qa/L0_perf_vllm/test.sh create mode 100755 qa/L0_python_api/test.sh rename qa/{L0_jetson_example => L0_python_client_unit_tests}/test.sh (57%) mode change 100644 => 100755 mode change 100644 => 100755 qa/L0_query/query_e2e.py mode change 100644 => 100755 qa/L0_query/test.sh mode change 100644 => 100755 qa/L0_rate_limiter/rate_limiter_test.py mode change 100644 => 100755 qa/L0_rate_limiter/test.sh mode change 100644 => 100755 qa/L0_register/test.sh mode change 100644 => 100755 qa/L0_repoagent_checksum/identity_test.py create mode 100755 qa/L0_request_cancellation/grpc_cancellation_test.py create mode 100755 qa/L0_request_cancellation/scheduler_test.py create mode 100755 qa/L0_request_cancellation/test.sh create mode 100644 qa/L0_response_cache/models/decoupled_cache/config.pbtxt create mode 100644 qa/L0_response_cache/models/identity_cache/config.pbtxt mode change 100644 => 100755 qa/L0_sagemaker/sagemaker_multi_model_test.py mode change 100644 => 100755 qa/L0_sagemaker/sagemaker_test.py mode change 100644 => 100755 qa/L0_savedmodel_shape/saved_model_shape_test.py mode change 100644 => 100755 qa/L0_savedmodel_shape/test.sh create mode 100755 qa/L0_scalar_io/scalar_test.py create mode 100755 qa/L0_scalar_io/test.sh mode change 100644 => 100755 qa/L0_secure_grpc/test.sh create mode 100644 qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt mode change 100644 => 100755 qa/L0_sequence_batcher/sequence_batcher_test.py mode change 100644 => 100755 qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py mode change 100644 => 100755 qa/L0_sequence_stress/sequence_stress.py mode change 100644 => 100755 qa/L0_server_status/server_status_test.py mode change 100644 => 100755 qa/L0_shared_memory/shared_memory_test.py mode change 100644 => 100755 qa/L0_shared_memory/test.sh mode change 100644 => 100755 qa/L0_simple_ensemble/ensemble_test.py mode change 100644 => 100755 qa/L0_simple_go_client/test.sh mode change 100644 => 100755 qa/L0_simple_nodejs_client/test.sh mode change 100644 => 100755 qa/L0_socket/test.sh delete mode 100644 qa/L0_storage_S3/infer_test.py create mode 100755 qa/L0_storage_S3_local/mock_s3_service.py rename qa/{L0_s3_local => L0_storage_S3_local}/test.sh (64%) mode change 100644 => 100755 delete mode 100644 qa/L0_storage_azure/infer_test.py mode change 100644 => 100755 qa/L0_storage_swiftstack/infer_test.py mode change 100644 => 100755 qa/L0_string_io/string_client_test.py create mode 100755 qa/L0_tf_gpu_io/tf_gpu_io_test.py create mode 100755 qa/L0_tf_parameters/test.sh create mode 100755 qa/L0_tf_parameters/tf_parameter_test.py mode change 100644 => 100755 qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py mode change 100644 => 100755 qa/L0_tf_unknown_rank/test.sh mode change 100644 => 100755 qa/L0_tf_unknown_rank/tf_unknown_rank_test.py mode change 100644 => 100755 qa/L0_tftrt_optimization/tftrt_optimization_test.py create mode 100644 qa/L0_trace/opentelemetry_unittest.py create mode 100644 qa/L0_trace/trace-config.yaml mode change 100644 => 100755 qa/L0_trace/trace_endpoint_test.py mode change 100644 => 100755 qa/L0_triton_repo_agent/test.sh create mode 100755 qa/L0_trt_compat/test.sh create mode 100755 qa/L0_trt_compat/trt_compatibility_test.py create mode 100755 qa/L0_trt_data_dependent_shape/test.sh create mode 100755 qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py mode change 100644 => 100755 qa/L0_trt_dla/dla_test.py mode change 100644 => 100755 qa/L0_trt_dla/test.sh mode change 100644 => 100755 qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py create mode 100755 qa/L0_trt_error_propagation/test.sh create mode 100755 qa/L0_trt_error_propagation/trt_error_propagation_test.py mode change 100644 => 100755 qa/L0_trt_plugin/test.sh mode change 100644 => 100755 qa/L0_trt_plugin/trt_plugin_test.py mode change 100644 => 100755 qa/L0_trt_reformat_free/trt_reformat_free_test.py mode change 100644 => 100755 qa/L0_trt_shape_tensors/test.sh mode change 100644 => 100755 qa/L0_trt_shape_tensors/trt_shape_tensor_test.py mode change 100644 => 100755 qa/L0_vertex_ai/test.sh mode change 100644 => 100755 qa/L0_vertex_ai/vertex_ai_test.py mode change 100644 => 100755 qa/L0_warmup/test.sh create mode 100644 qa/common/gen_common.py mode change 100644 => 100755 qa/common/gen_ensemble_model_utils.py create mode 100755 qa/common/gen_jetson_trt_models mode change 100644 => 100755 qa/common/gen_qa_custom_ops_models.py mode change 100644 => 100755 qa/common/gen_qa_dyna_sequence_implicit_models.py mode change 100644 => 100755 qa/common/gen_qa_dyna_sequence_models.py mode change 100644 => 100755 qa/common/gen_qa_identity_models.py mode change 100644 => 100755 qa/common/gen_qa_implicit_models.py mode change 100644 => 100755 qa/common/gen_qa_models.py mode change 100644 => 100755 qa/common/gen_qa_noshape_models.py create mode 100755 qa/common/gen_qa_ort_scalar_models.py create mode 100644 qa/common/gen_qa_pytorch_model.py mode change 100644 => 100755 qa/common/gen_qa_ragged_models.py mode change 100644 => 100755 qa/common/gen_qa_reshape_models.py mode change 100644 => 100755 qa/common/gen_qa_sequence_models.py create mode 100755 qa/common/gen_qa_tf_parameters.py mode change 100644 => 100755 qa/common/gen_qa_torchtrt_models.py create mode 100755 qa/common/gen_qa_trt_data_dependent_shape.py mode change 100644 => 100755 qa/common/gen_qa_trt_format_models.py mode change 100644 => 100755 qa/common/gen_qa_trt_plugin_models.py mode change 100644 => 100755 qa/common/gen_tag_sigdef.py delete mode 100755 qa/common/gen_xavier_trt_models create mode 100755 qa/common/infer_test.py mode change 100644 => 100755 qa/common/infer_util.py mode change 100644 => 100755 qa/common/inferentia_perf_analyzer_input_data_json/simple_model.py mode change 100644 => 100755 qa/common/libtorch_infer_client.py mode change 100644 => 100755 qa/common/nightly_email_helper.py create mode 100644 qa/common/perf_analyzer_input_data_json/int_data_optional.json create mode 100644 qa/common/perf_analyzer_input_data_json/repeat_int32_data.json mode change 100644 => 100755 qa/common/sequence_util.py mode change 100644 => 100755 qa/common/shm_util.py mode change 100644 => 100755 qa/common/test_util.py mode change 100755 => 100644 qa/custom_models/custom_zero_1_float32/config.pbtxt create mode 100644 qa/openvino_models/README.md create mode 100644 qa/openvino_models/dynamic_batch/1/model.bin create mode 100644 qa/openvino_models/dynamic_batch/1/model.mapping create mode 100644 qa/openvino_models/dynamic_batch/1/model.xml create mode 100644 qa/openvino_models/fixed_batch/1/model.bin create mode 100644 qa/openvino_models/fixed_batch/1/model.mapping create mode 100644 qa/openvino_models/fixed_batch/1/model.xml create mode 100644 qa/python_models/bls_finalize_error/config.pbtxt create mode 100644 qa/python_models/bls_finalize_error/model.py create mode 100644 qa/python_models/bls_init_error/config.pbtxt create mode 100644 qa/python_models/bls_init_error/model.py create mode 100644 qa/python_models/bls_model_loading/config.pbtxt create mode 100644 qa/python_models/bls_model_loading/model.py create mode 100644 qa/python_models/bls_onnx_warmup/config.pbtxt create mode 100644 qa/python_models/bls_onnx_warmup/model.py create mode 100644 qa/python_models/bls_parameters/config.pbtxt create mode 100644 qa/python_models/bls_parameters/model.py create mode 100644 qa/python_models/bls_request_rescheduling/config.pbtxt create mode 100644 qa/python_models/bls_request_rescheduling/model.py create mode 100644 qa/python_models/bls_simple/bls_simple.py create mode 100644 qa/python_models/bls_undefined/config.pbtxt create mode 100644 qa/python_models/bls_undefined/model.py create mode 100644 qa/python_models/cuda_memory_consumer/1/model.py create mode 100644 qa/python_models/cuda_memory_consumer/config.pbtxt create mode 100644 qa/python_models/custom_metrics/config.pbtxt create mode 100644 qa/python_models/custom_metrics/model.py create mode 100644 qa/python_models/dlpack_empty_output/config.pbtxt create mode 100644 qa/python_models/dlpack_empty_output/model.py create mode 100644 qa/python_models/dlpack_square/config.pbtxt create mode 100644 qa/python_models/dlpack_square/model.py create mode 100644 qa/python_models/error_code/config.pbtxt create mode 100644 qa/python_models/error_code/model.py create mode 100644 qa/python_models/execute_cancel/config.pbtxt create mode 100644 qa/python_models/execute_cancel/model.py create mode 100644 qa/python_models/ground_truth/config.pbtxt create mode 100644 qa/python_models/ground_truth/model.py create mode 100644 qa/python_models/identity_fp32_logging/config.pbtxt create mode 100644 qa/python_models/identity_fp32_logging/model.py create mode 100644 qa/python_models/identity_fp32_timeout/config.pbtxt create mode 100644 qa/python_models/identity_fp32_timeout/model.py create mode 100644 qa/python_models/init_exit/config.pbtxt create mode 100644 qa/python_models/init_exit/model.py create mode 100644 qa/python_models/iterative_sequence/config.pbtxt create mode 100644 qa/python_models/iterative_sequence/model.py create mode 100644 qa/python_models/model_init_del/config.pbtxt create mode 100644 qa/python_models/model_init_del/model.py create mode 100755 qa/python_models/model_init_del/util.py mode change 100644 => 100755 qa/python_models/multi_file/file1.py mode change 100644 => 100755 qa/python_models/multi_file/file2.py create mode 100644 qa/python_models/python_based_backends/add_sub_backend/model.py create mode 100644 qa/python_models/request_rescheduling_addsub/config.pbtxt create mode 100644 qa/python_models/request_rescheduling_addsub/model.py create mode 100644 qa/python_models/sequence_int32/config.pbtxt create mode 100644 qa/python_models/sequence_int32/model.py rename deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml => qa/python_models/sequence_py/config.pbtxt (76%) create mode 100644 qa/python_models/sequence_py/model.py create mode 100644 qa/python_models/torchvision/resnet50/config.pbtxt create mode 100644 qa/python_models/torchvision/resnet50/model.py create mode 100644 qa/python_models/variable_gpu_output/config.pbtxt create mode 100644 qa/python_models/variable_gpu_output/model.py create mode 100644 qa/python_models/wrong_return_type/config.pbtxt create mode 100644 qa/python_models/wrong_return_type/model.py create mode 100644 src/command_line_parser.cc create mode 100644 src/command_line_parser.h create mode 100644 src/grpc/CMakeLists.txt create mode 100644 src/grpc/grpc_handler.h create mode 100644 src/grpc/grpc_server.cc create mode 100644 src/grpc/grpc_server.h create mode 100644 src/grpc/grpc_utils.cc create mode 100644 src/grpc/grpc_utils.h create mode 100644 src/grpc/infer_handler.cc create mode 100644 src/grpc/infer_handler.h create mode 100644 src/grpc/stream_infer_handler.cc create mode 100644 src/grpc/stream_infer_handler.h delete mode 100644 src/grpc_server.cc delete mode 100644 src/grpc_server.h create mode 100644 src/restricted_features.h create mode 100644 src/test/iterative_sequence/CMakeLists.txt create mode 100644 src/test/iterative_sequence/cmake/TritonIterativeSequenceBackendConfig.cmake.in create mode 100644 src/test/iterative_sequence/src/iterative_sequence.cc rename deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-gateway.yaml => src/test/iterative_sequence/src/libtriton_iterative_sequence.ldscript (81%) diff --git a/.clang-format b/.clang-format index 98c649734c..1defc175de 100644 --- a/.clang-format +++ b/.clang-format @@ -2,6 +2,7 @@ BasedOnStyle: Google IndentWidth: 2 +ColumnLimit: 80 ContinuationIndentWidth: 4 UseTab: Never MaxEmptyLinesToKeep: 2 @@ -34,4 +35,5 @@ BinPackArguments: true BinPackParameters: true ConstructorInitializerAllOnOneLineOrOnePerLine: false -IndentCaseLabels: true \ No newline at end of file +IndentCaseLabels: true + diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml new file mode 100644 index 0000000000..745a33730b --- /dev/null +++ b/.github/workflows/codeql.yml @@ -0,0 +1,84 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "CodeQL" + +on: + pull_request: + +jobs: + analyze: + name: Analyze + runs-on: ubuntu-latest + permissions: + actions: read + contents: read + security-events: write + + strategy: + fail-fast: false + matrix: + language: [ 'python' ] + # CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ] + # Learn more about CodeQL language support at https://aka.ms/codeql-docs/language-support + + steps: + - name: Checkout repository + uses: actions/checkout@v3 + + # Initializes the CodeQL tools for scanning. + - name: Initialize CodeQL + uses: github/codeql-action/init@v2 + with: + languages: ${{ matrix.language }} + # If you wish to specify custom queries, you can do so here or in a config file. + # By default, queries listed here will override any specified in a config file. + # Prefix the list here with "+" to use these queries and those in the config file. + + # Details on CodeQL's query packs refer to: + # https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs + queries: +security-and-quality + + + # Autobuild attempts to build any compiled languages (C/C++, C#, Go, or Java). + # If this step fails, then you should remove it and run the build manually (see below) + - name: Autobuild + uses: github/codeql-action/autobuild@v2 + + # Command-line programs to run using the OS shell. + # See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun + + # If the Autobuild fails above, remove it and uncomment the following three lines. + # modify them (or add more) to build your code if your project, please refer to the EXAMPLE below for guidance. + + # - run: | + # echo "Run, Build Application using script" + # ./location_of_script_within_repo/buildscript.sh + + - name: Perform CodeQL Analysis + uses: github/codeql-action/analyze@v2 + with: + category: "/language:${{matrix.language}}" diff --git a/.github/workflows/pre-commit.yaml b/.github/workflows/pre-commit.yaml new file mode 100644 index 0000000000..531cc2911b --- /dev/null +++ b/.github/workflows/pre-commit.yaml @@ -0,0 +1,39 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: pre-commit + +on: + pull_request: + +jobs: + pre-commit: + runs-on: ubuntu-22.04 + steps: + - uses: actions/checkout@v3 + - uses: actions/setup-python@v3 + - uses: pre-commit/action@v3.0.0 + diff --git a/.gitignore b/.gitignore index 523a31748f..f1b69cb25e 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,8 @@ +/build /builddir /.vscode *.so +__pycache__ +tmp +*.log +test_results.txt diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 0000000000..f44f815351 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,74 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +repos: +- repo: https://github.com/timothycrosley/isort + rev: 5.12.0 + hooks: + - id: isort + additional_dependencies: [toml] +- repo: https://github.com/psf/black + rev: 23.1.0 + hooks: + - id: black + types_or: [python, cython] +- repo: https://github.com/PyCQA/flake8 + rev: 5.0.4 + hooks: + - id: flake8 + args: [--max-line-length=88, --select=C,E,F,W,B,B950, --extend-ignore = E203,E501] + types_or: [python, cython] +- repo: https://github.com/pre-commit/mirrors-clang-format + rev: v16.0.5 + hooks: + - id: clang-format + types_or: [c, c++, cuda, proto, textproto, java] + args: ["-fallback-style=none", "-style=file", "-i"] +- repo: https://github.com/codespell-project/codespell + rev: v2.2.4 + hooks: + - id: codespell + additional_dependencies: [tomli] + args: ["--toml", "pyproject.toml"] + exclude: (?x)^(.*stemmer.*|.*stop_words.*|^CHANGELOG.md$) +# More details about these pre-commit hooks here: +# https://pre-commit.com/hooks.html +- repo: https://github.com/pre-commit/pre-commit-hooks + rev: v4.4.0 + hooks: + - id: check-case-conflict + - id: check-executables-have-shebangs + - id: check-merge-conflict + - id: check-json + - id: check-toml + - id: check-yaml + exclude: ^deploy(\/[^\/]+)*\/templates\/.*$ + - id: check-shebang-scripts-are-executable + - id: end-of-file-fixer + types_or: [c, c++, cuda, proto, textproto, java, python] + - id: mixed-line-ending + - id: requirements-txt-fixer + - id: trailing-whitespace diff --git a/CITATION.cff b/CITATION.cff new file mode 100644 index 0000000000..f8fb8d09fb --- /dev/null +++ b/CITATION.cff @@ -0,0 +1,7 @@ +cff-version: 1.2.0 +message: "If you use this software, please cite it as below." +title: "Triton Inference Server: An Optimized Cloud and Edge Inferencing Solution." +url: https://github.com/triton-inference-server +repository-code: https://github.com/triton-inference-server/server +authors: + - name: "NVIDIA Corporation" diff --git a/CMakeLists.txt b/CMakeLists.txt index 6d4ec543df..13dc0c4e9b 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -1,4 +1,4 @@ -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -38,6 +38,7 @@ option(TRITON_ENABLE_TRACING "Include tracing support in server" OFF) option(TRITON_ENABLE_NVTX "Include NVTX support in server" OFF) option(TRITON_ENABLE_GPU "Enable GPU support in server" ON) option(TRITON_ENABLE_MALI_GPU "Enable Arm Mali GPU support in server" OFF) +option(TRITON_IGPU_BUILD "Enable options for iGPU compilation in sever" OFF) set(TRITON_MIN_COMPUTE_CAPABILITY "6.0" CACHE STRING "The minimum CUDA compute capability supported by Triton" ) set(TRITON_EXTRA_LIB_PATHS "" CACHE PATH "Extra library paths for Triton Server build") @@ -54,6 +55,7 @@ option(TRITON_ENABLE_VERTEX_AI "Include Vertex AI API in server" OFF) # Metrics option(TRITON_ENABLE_METRICS "Include metrics support in server" ON) option(TRITON_ENABLE_METRICS_GPU "Include GPU metrics support in server" ON) +option(TRITON_ENABLE_METRICS_CPU "Include CPU metrics support in server" ON) # Cloud storage option(TRITON_ENABLE_GCS "Include GCS Filesystem support in server" OFF) @@ -85,6 +87,10 @@ if(TRITON_ENABLE_TRACING AND NOT TRITON_ENABLE_STATS) message(FATAL_ERROR "TRITON_ENABLE_TRACING=ON requires TRITON_ENABLE_STATS=ON") endif() +if (TRITON_ENABLE_METRICS_CPU AND NOT TRITON_ENABLE_METRICS) + message(FATAL_ERROR "TRITON_ENABLE_METRICS_CPU=ON requires TRITON_ENABLE_METRICS=ON") +endif() + if (TRITON_ENABLE_METRICS_GPU AND NOT TRITON_ENABLE_METRICS) message(FATAL_ERROR "TRITON_ENABLE_METRICS_GPU=ON requires TRITON_ENABLE_METRICS=ON") endif() @@ -113,6 +119,19 @@ FetchContent_Declare( GIT_TAG ${TRITON_THIRD_PARTY_REPO_TAG} ) +# Some libs are installed to ${TRITON_THIRD_PARTY_INSTALL_PREFIX}/{LIB}/lib64 instead +# of ${TRITON_THIRD_PARTY_INSTALL_PREFIX}/{LIB}/lib on Centos +set (LIB_DIR "lib") +# /etc/os-release does not exist on Windows +if(EXISTS "/etc/os-release") + file(STRINGS /etc/os-release DISTRO REGEX "^NAME=") + string(REGEX REPLACE "NAME=\"(.*)\"" "\\1" DISTRO "${DISTRO}") + message(STATUS "Distro Name: ${DISTRO}") + if(DISTRO MATCHES "CentOS.*") + set (LIB_DIR "lib64") + endif() +endif() + set(TRITON_CORE_HEADERS_ONLY OFF) FetchContent_MakeAvailable(repo-third-party repo-core) @@ -152,7 +171,16 @@ endif() if (WIN32) set(_FINDPACKAGE_PROTOBUF_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/protobuf/cmake") else() - set(_FINDPACKAGE_PROTOBUF_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/protobuf/lib/cmake/protobuf") + set(_FINDPACKAGE_PROTOBUF_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/protobuf/${LIB_DIR}/cmake/protobuf") +endif() + +# Triton with Opentelemetry is not supported on Windows +# FIXME: add location for Windows, when support is added +# JIRA DLIS-4786 +if (WIN32) + set(_FINDPACKAGE_OPENTELEMETRY_CONFIG_DIR "") +else() + set(_FINDPACKAGE_OPENTELEMETRY_CONFIG_DIR "${TRITON_THIRD_PARTY_INSTALL_PREFIX}/opentelemetry-cpp/${LIB_DIR}/cmake/opentelemetry-cpp") endif() if (CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT) @@ -168,15 +196,15 @@ endif() # TRITON_ENABLE_GCS if(${TRITON_ENABLE_S3}) set(TRITON_DEPENDS ${TRITON_DEPENDS} aws-sdk-cpp) endif() # TRITON_ENABLE_S3 -if(${TRITON_ENABLE_AZURE_STORAGE}) - set(TRITON_DEPENDS ${TRITON_DEPENDS} azure-storage-cpplite) -endif() # TRITON_ENABLE_AZURE_STORAGE if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR ${TRITON_ENABLE_SAGEMAKER} OR ${TRITON_ENABLE_VERTEX_AI}) set(TRITON_DEPENDS ${TRITON_DEPENDS} libevent libevhtp) endif() # TRITON_ENABLE_HTTP || TRITON_ENABLE_METRICS || TRITON_ENABLE_SAGEMAKER || TRITON_ENABLE_VERTEX_AI if(${TRITON_ENABLE_GRPC}) set(TRITON_DEPENDS ${TRITON_DEPENDS} grpc) endif() # TRITON_ENABLE_GRPC +if(NOT WIN32 AND ${TRITON_ENABLE_TRACING}) + set(TRITON_DEPENDS ${TRITON_DEPENDS} opentelemetry-cpp) +endif() # TRITON_ENABLE_TRACING ExternalProject_Add(triton-server PREFIX triton-server @@ -189,21 +217,23 @@ ExternalProject_Add(triton-server ${_CMAKE_ARGS_VCPKG_TARGET_TRIPLET} -DGTEST_ROOT:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/googletest -DgRPC_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/grpc/lib/cmake/grpc - -Dc-ares_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/c-ares/lib/cmake/c-ares - -Dabsl_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/absl/lib/cmake/absl - -Dnlohmann_json_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/nlohmann_json/lib/cmake/nlohmann_json + -Dc-ares_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/c-ares/${LIB_DIR}/cmake/c-ares + -Dabsl_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/absl/${LIB_DIR}/cmake/absl + -DCURL_DIR:STRING=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/curl/${LIB_DIR}/cmake/CURL + -Dnlohmann_json_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/nlohmann_json/${LIB_DIR}/cmake/nlohmann_json -DLibevent_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/libevent/lib/cmake/libevent -Dlibevhtp_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/libevhtp/lib/cmake/libevhtp - -Dstorage_client_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/lib/cmake/storage_client - -Dazure-storage-cpplite_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/azure-storage-cpplite - -Dgoogle_cloud_cpp_common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/lib/cmake/google_cloud_cpp_common - -DCrc32c_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/crc32c/lib/cmake/Crc32c - -DAWSSDK_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/cmake/AWSSDK - -Daws-cpp-sdk-core_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/cmake/aws-cpp-sdk-core - -Daws-cpp-sdk-s3_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/cmake/aws-cpp-sdk-s3 - -Daws-c-event-stream_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/aws-c-event-stream/cmake - -Daws-c-common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/aws-c-common/cmake - -Daws-checksums_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/lib/aws-checksums/cmake + -Dstorage_client_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/${LIB_DIR}/cmake/storage_client + -Dgoogle_cloud_cpp_common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/google-cloud-cpp/${LIB_DIR}/cmake/google_cloud_cpp_common + -DCrc32c_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/crc32c/${LIB_DIR}/cmake/Crc32c + -DAWSSDK_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/cmake/AWSSDK + -Daws-cpp-sdk-core_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/cmake/aws-cpp-sdk-core + -Daws-cpp-sdk-s3_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/cmake/aws-cpp-sdk-s3 + -Daws-c-event-stream_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/aws-c-event-stream/cmake + -Daws-c-common_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/aws-c-common/cmake + -Daws-checksums_DIR:PATH=${TRITON_THIRD_PARTY_INSTALL_PREFIX}/aws-sdk-cpp/${LIB_DIR}/aws-checksums/cmake + -Dopentelemetry-cpp_DIR:PATH=${_FINDPACKAGE_OPENTELEMETRY_CONFIG_DIR} + -DTRITON_IGPU_BUILD:BOOL=${TRITON_IGPU_BUILD} -DTRITON_THIRD_PARTY_REPO_TAG:STRING=${TRITON_THIRD_PARTY_REPO_TAG} -DTRITON_COMMON_REPO_TAG:STRING=${TRITON_COMMON_REPO_TAG} -DTRITON_CORE_REPO_TAG:STRING=${TRITON_CORE_REPO_TAG} @@ -223,6 +253,7 @@ ExternalProject_Add(triton-server -DTRITON_MIN_COMPUTE_CAPABILITY:STRING=${TRITON_MIN_COMPUTE_CAPABILITY} -DTRITON_ENABLE_METRICS:BOOL=${TRITON_ENABLE_METRICS} -DTRITON_ENABLE_METRICS_GPU:BOOL=${TRITON_ENABLE_METRICS_GPU} + -DTRITON_ENABLE_METRICS_CPU:BOOL=${TRITON_ENABLE_METRICS_CPU} -DTRITON_ENABLE_GCS:BOOL=${TRITON_ENABLE_GCS} -DTRITON_ENABLE_AZURE_STORAGE:BOOL=${TRITON_ENABLE_AZURE_STORAGE} -DTRITON_ENABLE_S3:BOOL=${TRITON_ENABLE_S3} diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index dbc3f9bdb4..59e0ace975 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,5 +1,5 @@ + +# Report a Security Vulnerability + +To report a potential security vulnerability in any NVIDIA product, please use either: +* This web form: [Security Vulnerability Submission Form](https://www.nvidia.com/object/submit-security-vulnerability.html), or +* Send email to: [NVIDIA PSIRT](mailto:psirt@nvidia.com) + +**OEM Partners should contact their NVIDIA Customer Program Manager** + +If reporting a potential vulnerability via email, please encrypt it using NVIDIA’s public PGP key ([see PGP Key page](https://www.nvidia.com/en-us/security/pgp-key/)) and include the following information: +1. Product/Driver name and version/branch that contains the vulnerability +2. Type of vulnerability (code execution, denial of service, buffer overflow, etc.) +3. Instructions to reproduce the vulnerability +4. Proof-of-concept or exploit code +5. Potential impact of the vulnerability, including how an attacker could exploit the vulnerability + +See https://www.nvidia.com/en-us/security/ for past NVIDIA Security Bulletins and Notices. diff --git a/TRITON_VERSION b/TRITON_VERSION index 7609fc9e9e..25aa01454a 100644 --- a/TRITON_VERSION +++ b/TRITON_VERSION @@ -1 +1 @@ -2.24.0dev +2.42.0dev diff --git a/build.py b/build.py index 0e808b591f..9b61dd5182 100755 --- a/build.py +++ b/build.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,26 +26,26 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse -import logging +import importlib.util +import multiprocessing import os import os.path -import multiprocessing import pathlib import platform -import shutil import stat import subprocess import sys -import traceback from inspect import getsourcefile +import requests + # # Build Triton Inference Server. # # By default build.py builds the Triton Docker image, but can also be # used to build without Docker. See docs/build.md and --help for more -# infomation. +# information. # # The TRITON_VERSION file indicates the Triton version and # TRITON_VERSION_MAP is used to determine the corresponding container @@ -69,41 +69,20 @@ # different versions are used then one backend or the other will # incorrectly load the other version of the openvino libraries. # -# The standalone openVINO describes multiple versions where each version -# is a pair of openVINO version and openVINO package version. When openVINO -# package version is specified, then backend will be built with pre-built -# openVINO release from Intel. If the package version is specified as None, -# then openVINO for the backend is built from source with openMP support. -# By default, only the first version is built. To build the all the versions -# in list use --build-multiple-openvino. Triton will use the first version -# for inference by default. In order to use different version, Triton should -# be invoked with appropriate backend configuration: -# (--backend-config=openvino,version=) -# The version string can be obtained as follows: -# _[_pre] -# Append '_pre' only if the openVINO backend was built with prebuilt openVINO -# library. In other words, when the second element of the pair is not None. -# To use ('2021.4', None) version_str should be `2021_4'. -# To use ('2021.4', '2021.4.582') version_str should be `2021_4_pre'. -# User can also build openvino backend from specific commit sha of openVINO -# repository. The pair should be (`SPECIFIC`, ). -# Note: Not all sha ids would successfuly compile and work. -# Note: When updating the conda version, make sure to update the shasum of -# the packages used for different platforms in install_miniconda function. -# TRITON_VERSION_MAP = { - '2.24.0dev': ( - '22.07dev', # triton container - '22.05', # upstream container - '1.11.1', # ORT - '2021.4.582', # ORT OpenVINO - (('2021.4', None), ('2021.4', '2021.4.582'), - ('SPECIFIC', 'f2f281e6')), # Standalone OpenVINO - '2.2.9', # DCGM version - 'py38_4.12.0') # Conda version. + "2.42.0dev": ( + "24.01dev", # triton container + "23.11", # upstream container + "1.16.3", # ORT + "2023.0.0", # ORT OpenVINO + "2023.0.0", # Standalone OpenVINO + "3.2.6", # DCGM version + "py310_23.1.0-1", # Conda version + "0.2.2", # vLLM version + ) } -CORE_BACKENDS = ['ensemble'] +CORE_BACKENDS = ["ensemble"] FLAGS = None EXTRA_CORE_CMAKE_FLAGS = {} @@ -119,7 +98,7 @@ def log(msg, force=False): try: print(msg, file=sys.stderr) except Exception: - print('', file=sys.stderr) + print("", file=sys.stderr) def log_verbose(msg): @@ -133,7 +112,7 @@ def fail(msg): def fail_if(p, msg): if p: - print('error: {}'.format(msg), file=sys.stderr) + print("error: {}".format(msg), file=sys.stderr) sys.exit(1) @@ -149,26 +128,14 @@ def target_machine(): return platform.machine().lower() -def tagged_backend(be, version): - tagged_be = be - if be == 'openvino': - if version[0] == 'SPECIFIC': - tagged_be += "_" + version[1] - else: - tagged_be += "_" + version[0].replace('.', '_') - if version[1] and target_platform() != 'windows': - tagged_be += "_pre" - return tagged_be - - def container_versions(version, container_version, upstream_container_version): if container_version is None: if version not in TRITON_VERSION_MAP: - fail('container version not known for {}'.format(version)) + fail("container version not known for {}".format(version)) container_version = TRITON_VERSION_MAP[version][0] if upstream_container_version is None: if version not in TRITON_VERSION_MAP: - fail('upstream container version not known for {}'.format(version)) + fail("upstream container version not known for {}".format(version)) upstream_container_version = TRITON_VERSION_MAP[version][1] return container_version, upstream_container_version @@ -193,13 +160,13 @@ def __del__(self): def close(self): if self._file is not None: - if target_platform() == 'windows': + if target_platform() == "windows": self.blankln() - self._file.write('}\n') - self._file.write('catch {\n') - self._file.write(' $_;\n') - self._file.write(' ExitWithCode 1;\n') - self._file.write('}\n') + self._file.write("}\n") + self._file.write("catch {\n") + self._file.write(" $_;\n") + self._file.write(" ExitWithCode 1;\n") + self._file.write("}\n") """Close the file""" self._file.close() self._file = None @@ -207,28 +174,28 @@ def close(self): os.chmod(self._filepath, st.st_mode | stat.S_IEXEC) def blankln(self): - self._file.write('\n') + self._file.write("\n") def commentln(self, cnt): - self._file.write('#' * cnt + '\n') + self._file.write("#" * cnt + "\n") - def comment(self, msg=''): + def comment(self, msg=""): if not isinstance(msg, str): try: for m in msg: - self._file.write(f'# {msg}\n') + self._file.write(f"# {msg}\n") return except TypeError: pass - self._file.write(f'# {msg}\n') + self._file.write(f"# {msg}\n") - def comment_verbose(self, msg=''): + def comment_verbose(self, msg=""): if self._verbose: self.comment(msg) def header(self, desc=None): - if target_platform() != 'windows': - self._file.write('#!/usr/bin/env bash\n\n') + if target_platform() != "windows": + self._file.write("#!/usr/bin/env bash\n\n") if desc is not None: self.comment() @@ -236,132 +203,134 @@ def header(self, desc=None): self.comment() self.blankln() - self.comment('Exit script immediately if any command fails') - if target_platform() == 'windows': - self._file.write('function ExitWithCode($exitcode) {\n') - self._file.write(' $host.SetShouldExit($exitcode)\n') - self._file.write(' exit $exitcode\n') - self._file.write('}\n') + self.comment("Exit script immediately if any command fails") + if target_platform() == "windows": + self._file.write("function ExitWithCode($exitcode) {\n") + self._file.write(" $host.SetShouldExit($exitcode)\n") + self._file.write(" exit $exitcode\n") + self._file.write("}\n") self.blankln() if self._verbose: - self._file.write('Set-PSDebug -Trace 1\n') + self._file.write("Set-PSDebug -Trace 1\n") self.blankln() - self._file.write('try {\n') + self._file.write("try {\n") else: - self._file.write('set -e\n') + self._file.write("set -e\n") if self._verbose: - self._file.write('set -x\n') + self._file.write("set -x\n") self.blankln() def envvar_ref(self, v): - if target_platform() == 'windows': - return f'${{env:{v}}}' - return f'${{{v}}}' + if target_platform() == "windows": + return f"${{env:{v}}}" + return f"${{{v}}}" def cmd(self, clist, check_exitcode=False): if isinstance(clist, str): - self._file.write(f'{clist}\n') + self._file.write(f"{clist}\n") else: for c in clist: - self._file.write(f'{c} ') + self._file.write(f"{c} ") self.blankln() if check_exitcode: - if target_platform() == 'windows': - self._file.write('if ($LASTEXITCODE -ne 0) {\n') + if target_platform() == "windows": + self._file.write("if ($LASTEXITCODE -ne 0) {\n") self._file.write( - ' Write-Output "exited with status code $LASTEXITCODE";\n') - self._file.write(' ExitWithCode 1;\n') - self._file.write('}\n') + ' Write-Output "exited with status code $LASTEXITCODE";\n' + ) + self._file.write(" ExitWithCode 1;\n") + self._file.write("}\n") def cwd(self, path): - if target_platform() == 'windows': - self.cmd(f'Set-Location -EV Err -EA Stop {path}') + if target_platform() == "windows": + self.cmd(f"Set-Location -EV Err -EA Stop {path}") else: - self.cmd(f'cd {path}') + self.cmd(f"cd {path}") def cp(self, src, dest): - if target_platform() == 'windows': - self.cmd(f'Copy-Item -EV Err -EA Stop {src} -Destination {dest}') + if target_platform() == "windows": + self.cmd(f"Copy-Item -EV Err -EA Stop {src} -Destination {dest}") else: - self.cmd(f'cp {src} {dest}') + self.cmd(f"cp {src} {dest}") def mkdir(self, path): - if target_platform() == 'windows': + if target_platform() == "windows": self.cmd( - f'New-Item -EV Err -EA Stop -ItemType Directory -Force -Path {path}' + f"New-Item -EV Err -EA Stop -ItemType Directory -Force -Path {path}" ) else: - self.cmd(f'mkdir -p {pathlib.Path(path)}') + self.cmd(f"mkdir -p {pathlib.Path(path)}") def rmdir(self, path): - if target_platform() == 'windows': - self.cmd(f'if (Test-Path -Path {path}) {{') - self.cmd(f' Remove-Item -EV Err -EA Stop -Recurse -Force {path}') - self.cmd('}') + if target_platform() == "windows": + self.cmd(f"if (Test-Path -Path {path}) {{") + self.cmd(f" Remove-Item -EV Err -EA Stop -Recurse -Force {path}") + self.cmd("}") else: - self.cmd(f'rm -fr {pathlib.Path(path)}') + self.cmd(f"rm -fr {pathlib.Path(path)}") def cpdir(self, src, dest): - if target_platform() == 'windows': - self.cmd( - f'Copy-Item -EV Err -EA Stop -Recurse {src} -Destination {dest}' - ) + if target_platform() == "windows": + self.cmd(f"Copy-Item -EV Err -EA Stop -Recurse {src} -Destination {dest}") else: - self.cmd(f'cp -r {src} {dest}') + self.cmd(f"cp -r {src} {dest}") def tar(self, subdir, tar_filename): - if target_platform() == 'windows': - fail('unsupported operation: tar') + if target_platform() == "windows": + fail("unsupported operation: tar") else: - self.cmd(f'tar zcf {tar_filename} {subdir}') + self.cmd(f"tar zcf {tar_filename} {subdir}") def cmake(self, args): # Pass some additional envvars into cmake... env_args = [] - for k in ('TRT_VERSION', 'DALI_VERSION', 'CMAKE_TOOLCHAIN_FILE', - 'VCPKG_TARGET_TRIPLET'): + for k in ("TRT_VERSION", "CMAKE_TOOLCHAIN_FILE", "VCPKG_TARGET_TRIPLET"): env_args += [f'"-D{k}={self.envvar_ref(k)}"'] - self.cmd(f'cmake {" ".join(env_args)} {" ".join(args)}', - check_exitcode=True) + self.cmd(f'cmake {" ".join(env_args)} {" ".join(args)}', check_exitcode=True) - def makeinstall(self, target='install'): - if target_platform() == 'windows': - verbose_flag = '' if self._verbose else '-clp:ErrorsOnly' + def makeinstall(self, target="install"): + if target_platform() == "windows": + verbose_flag = "" if self._verbose else "-clp:ErrorsOnly" self.cmd( - f'msbuild.exe -m:{FLAGS.build_parallel} {verbose_flag} -p:Configuration={FLAGS.build_type} {target}.vcxproj', - check_exitcode=True) + f"msbuild.exe -m:{FLAGS.build_parallel} {verbose_flag} -p:Configuration={FLAGS.build_type} {target}.vcxproj", + check_exitcode=True, + ) else: - verbose_flag = 'VERBOSE=1' if self._verbose else 'VERBOSE=0' - self.cmd(f'make -j{FLAGS.build_parallel} {verbose_flag} {target}') + verbose_flag = "VERBOSE=1" if self._verbose else "VERBOSE=0" + self.cmd(f"make -j{FLAGS.build_parallel} {verbose_flag} {target}") def gitclone(self, repo, tag, subdir, org): clone_dir = subdir if not FLAGS.no_force_clone: self.rmdir(clone_dir) - if target_platform() == 'windows': - self.cmd(f'if (-Not (Test-Path -Path {clone_dir})) {{') + if target_platform() == "windows": + self.cmd(f"if (-Not (Test-Path -Path {clone_dir})) {{") else: - self.cmd(f'if [[ ! -e {clone_dir} ]]; then') + self.cmd(f"if [[ ! -e {clone_dir} ]]; then") + # FIXME [DLIS-4045 - Currently the tag starting with "pull/" is not + # working with "--repo-tag" as the option is not forwarded to the + # individual repo build correctly.] # If 'tag' starts with "pull/" then it must be of form # "pull//head". We just clone at "main" and then fetch the # reference onto a new branch we name "tritonbuildref". if tag.startswith("pull/"): self.cmd( - f' git clone --recursive --depth=1 {org}/{repo}.git {subdir};', - check_exitcode=True) - self.cmd('}' if target_platform() == 'windows' else 'fi') + f" git clone --recursive --depth=1 {org}/{repo}.git {subdir};", + check_exitcode=True, + ) + self.cmd("}" if target_platform() == "windows" else "fi") self.cwd(subdir) - self.cmd(f'git fetch origin {tag}:tritonbuildref', - check_exitcode=True) - self.cmd(f'git checkout tritonbuildref', check_exitcode=True) + self.cmd(f"git fetch origin {tag}:tritonbuildref", check_exitcode=True) + self.cmd(f"git checkout tritonbuildref", check_exitcode=True) else: self.cmd( - f' git clone --recursive --single-branch --depth=1 -b {tag} {org}/{repo}.git {subdir};', - check_exitcode=True) - self.cmd('}' if target_platform() == 'windows' else 'fi') + f" git clone --recursive --single-branch --depth=1 -b {tag} {org}/{repo}.git {subdir};", + check_exitcode=True, + ) + self.cmd("}" if target_platform() == "windows" else "fi") def cmake_core_arg(name, type, value): @@ -370,9 +339,9 @@ def cmake_core_arg(name, type, value): if name in OVERRIDE_CORE_CMAKE_FLAGS: value = OVERRIDE_CORE_CMAKE_FLAGS[name] if type is None: - type = '' + type = "" else: - type = ':{}'.format(type) + type = ":{}".format(type) return '"-D{}{}={}"'.format(name, type, value) @@ -383,7 +352,7 @@ def cmake_core_enable(name, flag): if name in OVERRIDE_CORE_CMAKE_FLAGS: value = OVERRIDE_CORE_CMAKE_FLAGS[name] else: - value = 'ON' if flag else 'OFF' + value = "ON" if flag else "OFF" return '"-D{}:BOOL={}"'.format(name, value) @@ -401,9 +370,9 @@ def cmake_backend_arg(backend, name, type, value): if name in OVERRIDE_BACKEND_CMAKE_FLAGS[backend]: value = OVERRIDE_BACKEND_CMAKE_FLAGS[backend][name] if type is None: - type = '' + type = "" else: - type = ':{}'.format(type) + type = ":{}".format(type) return '"-D{}{}={}"'.format(name, type, value) @@ -416,7 +385,7 @@ def cmake_backend_enable(backend, name, flag): if name in OVERRIDE_BACKEND_CMAKE_FLAGS[backend]: value = OVERRIDE_BACKEND_CMAKE_FLAGS[backend][name] if value is None: - value = 'ON' if flag else 'OFF' + value = "ON" if flag else "OFF" return '"-D{}:BOOL={}"'.format(name, value) @@ -431,15 +400,15 @@ def cmake_backend_extra_args(backend): def cmake_repoagent_arg(name, type, value): # For now there is no override for repo-agents if type is None: - type = '' + type = "" else: - type = ':{}'.format(type) + type = ":{}".format(type) return '"-D{}{}={}"'.format(name, type, value) def cmake_repoagent_enable(name, flag): # For now there is no override for repo-agents - value = 'ON' if flag else 'OFF' + value = "ON" if flag else "OFF" return '"-D{}:BOOL={}"'.format(name, value) @@ -449,63 +418,80 @@ def cmake_repoagent_extra_args(): return args +def cmake_cache_arg(name, type, value): + # For now there is no override for caches + if type is None: + type = "" + else: + type = ":{}".format(type) + return '"-D{}{}={}"'.format(name, type, value) + + +def cmake_cache_enable(name, flag): + # For now there is no override for caches + value = "ON" if flag else "OFF" + return '"-D{}:BOOL={}"'.format(name, value) + + +def cmake_cache_extra_args(): + # For now there is no extra args for caches + args = [] + return args + + def core_cmake_args(components, backends, cmake_dir, install_dir): cargs = [ - cmake_core_arg('CMAKE_BUILD_TYPE', None, FLAGS.build_type), - cmake_core_arg('CMAKE_INSTALL_PREFIX', 'PATH', install_dir), - cmake_core_arg('TRITON_VERSION', 'STRING', FLAGS.version), - cmake_core_arg('TRITON_COMMON_REPO_TAG', 'STRING', - components['common']), - cmake_core_arg('TRITON_CORE_REPO_TAG', 'STRING', components['core']), - cmake_core_arg('TRITON_BACKEND_REPO_TAG', 'STRING', - components['backend']), - cmake_core_arg('TRITON_THIRD_PARTY_REPO_TAG', 'STRING', - components['thirdparty']) + cmake_core_arg("CMAKE_BUILD_TYPE", None, FLAGS.build_type), + cmake_core_arg("CMAKE_INSTALL_PREFIX", "PATH", install_dir), + cmake_core_arg("TRITON_VERSION", "STRING", FLAGS.version), + cmake_core_arg("TRITON_COMMON_REPO_TAG", "STRING", components["common"]), + cmake_core_arg("TRITON_CORE_REPO_TAG", "STRING", components["core"]), + cmake_core_arg("TRITON_BACKEND_REPO_TAG", "STRING", components["backend"]), + cmake_core_arg( + "TRITON_THIRD_PARTY_REPO_TAG", "STRING", components["thirdparty"] + ), ] + cargs.append(cmake_core_enable("TRITON_ENABLE_LOGGING", FLAGS.enable_logging)) + cargs.append(cmake_core_enable("TRITON_ENABLE_STATS", FLAGS.enable_stats)) + cargs.append(cmake_core_enable("TRITON_ENABLE_METRICS", FLAGS.enable_metrics)) cargs.append( - cmake_core_enable('TRITON_ENABLE_LOGGING', FLAGS.enable_logging)) - cargs.append(cmake_core_enable('TRITON_ENABLE_STATS', FLAGS.enable_stats)) - cargs.append( - cmake_core_enable('TRITON_ENABLE_METRICS', FLAGS.enable_metrics)) - cargs.append( - cmake_core_enable('TRITON_ENABLE_METRICS_GPU', - FLAGS.enable_gpu_metrics)) + cmake_core_enable("TRITON_ENABLE_METRICS_GPU", FLAGS.enable_gpu_metrics) + ) cargs.append( - cmake_core_enable('TRITON_ENABLE_TRACING', FLAGS.enable_tracing)) - cargs.append(cmake_core_enable('TRITON_ENABLE_NVTX', FLAGS.enable_nvtx)) + cmake_core_enable("TRITON_ENABLE_METRICS_CPU", FLAGS.enable_cpu_metrics) + ) + cargs.append(cmake_core_enable("TRITON_ENABLE_TRACING", FLAGS.enable_tracing)) + cargs.append(cmake_core_enable("TRITON_ENABLE_NVTX", FLAGS.enable_nvtx)) - cargs.append(cmake_core_enable('TRITON_ENABLE_GPU', FLAGS.enable_gpu)) + cargs.append(cmake_core_enable("TRITON_ENABLE_GPU", FLAGS.enable_gpu)) cargs.append( - cmake_core_arg('TRITON_MIN_COMPUTE_CAPABILITY', None, - FLAGS.min_compute_capability)) + cmake_core_arg( + "TRITON_MIN_COMPUTE_CAPABILITY", None, FLAGS.min_compute_capability + ) + ) - cargs.append( - cmake_core_enable('TRITON_ENABLE_MALI_GPU', FLAGS.enable_mali_gpu)) + cargs.append(cmake_core_enable("TRITON_ENABLE_MALI_GPU", FLAGS.enable_mali_gpu)) + cargs.append(cmake_core_enable("TRITON_ENABLE_GRPC", "grpc" in FLAGS.endpoint)) + cargs.append(cmake_core_enable("TRITON_ENABLE_HTTP", "http" in FLAGS.endpoint)) cargs.append( - cmake_core_enable('TRITON_ENABLE_GRPC', 'grpc' in FLAGS.endpoint)) - cargs.append( - cmake_core_enable('TRITON_ENABLE_HTTP', 'http' in FLAGS.endpoint)) - cargs.append( - cmake_core_enable('TRITON_ENABLE_SAGEMAKER', 'sagemaker' - in FLAGS.endpoint)) + cmake_core_enable("TRITON_ENABLE_SAGEMAKER", "sagemaker" in FLAGS.endpoint) + ) cargs.append( - cmake_core_enable('TRITON_ENABLE_VERTEX_AI', 'vertex-ai' - in FLAGS.endpoint)) + cmake_core_enable("TRITON_ENABLE_VERTEX_AI", "vertex-ai" in FLAGS.endpoint) + ) + cargs.append(cmake_core_enable("TRITON_ENABLE_GCS", "gcs" in FLAGS.filesystem)) + cargs.append(cmake_core_enable("TRITON_ENABLE_S3", "s3" in FLAGS.filesystem)) cargs.append( - cmake_core_enable('TRITON_ENABLE_GCS', 'gcs' in FLAGS.filesystem)) - cargs.append(cmake_core_enable('TRITON_ENABLE_S3', 's3' - in FLAGS.filesystem)) - cargs.append( - cmake_core_enable('TRITON_ENABLE_AZURE_STORAGE', 'azure_storage' - in FLAGS.filesystem)) + cmake_core_enable( + "TRITON_ENABLE_AZURE_STORAGE", "azure_storage" in FLAGS.filesystem + ) + ) - cargs.append( - cmake_core_enable('TRITON_ENABLE_ENSEMBLE', 'ensemble' in backends)) - cargs.append( - cmake_core_enable('TRITON_ENABLE_TENSORRT', 'tensorrt' in backends)) + cargs.append(cmake_core_enable("TRITON_ENABLE_ENSEMBLE", "ensemble" in backends)) + cargs.append(cmake_core_enable("TRITON_ENABLE_TENSORRT", "tensorrt" in backends)) cargs += cmake_core_extra_args() cargs.append(cmake_dir) @@ -513,346 +499,391 @@ def core_cmake_args(components, backends, cmake_dir, install_dir): def repoagent_repo(ra): - return '{}_repository_agent'.format(ra) + return "{}_repository_agent".format(ra) def repoagent_cmake_args(images, components, ra, install_dir): args = [] cargs = args + [ - cmake_repoagent_arg('CMAKE_BUILD_TYPE', None, FLAGS.build_type), - cmake_repoagent_arg('CMAKE_INSTALL_PREFIX', 'PATH', install_dir), - cmake_repoagent_arg('TRITON_COMMON_REPO_TAG', 'STRING', - components['common']), - cmake_repoagent_arg('TRITON_CORE_REPO_TAG', 'STRING', - components['core']) + cmake_repoagent_arg("CMAKE_BUILD_TYPE", None, FLAGS.build_type), + cmake_repoagent_arg("CMAKE_INSTALL_PREFIX", "PATH", install_dir), + cmake_repoagent_arg("TRITON_COMMON_REPO_TAG", "STRING", components["common"]), + cmake_repoagent_arg("TRITON_CORE_REPO_TAG", "STRING", components["core"]), ] - cargs.append(cmake_repoagent_enable('TRITON_ENABLE_GPU', FLAGS.enable_gpu)) + cargs.append(cmake_repoagent_enable("TRITON_ENABLE_GPU", FLAGS.enable_gpu)) cargs += cmake_repoagent_extra_args() - cargs.append('..') + cargs.append("..") + return cargs + + +def cache_repo(cache): + # example: "local", or "redis" + return "{}_cache".format(cache) + + +def cache_cmake_args(images, components, cache, install_dir): + args = [] + + cargs = args + [ + cmake_cache_arg("CMAKE_BUILD_TYPE", None, FLAGS.build_type), + cmake_cache_arg("CMAKE_INSTALL_PREFIX", "PATH", install_dir), + cmake_cache_arg("TRITON_COMMON_REPO_TAG", "STRING", components["common"]), + cmake_cache_arg("TRITON_CORE_REPO_TAG", "STRING", components["core"]), + ] + + cargs.append(cmake_cache_enable("TRITON_ENABLE_GPU", FLAGS.enable_gpu)) + cargs += cmake_cache_extra_args() + cargs.append("..") return cargs def backend_repo(be): - if (be == 'tensorflow1') or (be == 'tensorflow2'): - return 'tensorflow_backend' - if be.startswith("openvino"): - return 'openvino_backend' - return '{}_backend'.format(be) + return "{}_backend".format(be) + +def backend_cmake_args(images, components, be, install_dir, library_paths): + cmake_build_type = FLAGS.build_type -def backend_cmake_args(images, components, be, install_dir, library_paths, - variant_index): - if be == 'onnxruntime': + if be == "onnxruntime": args = onnxruntime_cmake_args(images, library_paths) - elif be.startswith('openvino'): - args = openvino_cmake_args(be, variant_index) - elif be == 'tensorflow1': - args = tensorflow_cmake_args(1, images, library_paths) - elif be == 'tensorflow2': - args = tensorflow_cmake_args(2, images, library_paths) - elif be == 'python': + elif be == "openvino": + args = openvino_cmake_args() + elif be == "tensorflow": + args = tensorflow_cmake_args(images, library_paths) + elif be == "python": args = [] - elif be == 'dali': + elif be == "dali": args = dali_cmake_args() - elif be == 'pytorch': + elif be == "pytorch": args = pytorch_cmake_args(images) - elif be == 'armnn_tflite': + elif be == "armnn_tflite": args = armnn_tflite_cmake_args() - elif be == 'fil': + elif be == "fil": args = fil_cmake_args(images) - elif be == 'fastertransformer': - args = [] - elif be == 'tensorrt': + # DLIS-4618: FIL backend fails debug build, so override it for now. + cmake_build_type = "Release" + elif be == "fastertransformer": + args = fastertransformer_cmake_args() + elif be == "tensorrt": args = tensorrt_cmake_args() + elif be == "tensorrtllm": + args = tensorrtllm_cmake_args(images) else: args = [] cargs = args + [ - cmake_backend_arg(be, 'CMAKE_BUILD_TYPE', None, FLAGS.build_type), - cmake_backend_arg(be, 'CMAKE_INSTALL_PREFIX', 'PATH', install_dir), - cmake_backend_arg(be, 'TRITON_COMMON_REPO_TAG', 'STRING', - components['common']), - cmake_backend_arg(be, 'TRITON_CORE_REPO_TAG', 'STRING', - components['core']), - cmake_backend_arg(be, 'TRITON_BACKEND_REPO_TAG', 'STRING', - components['backend']) + cmake_backend_arg(be, "CMAKE_BUILD_TYPE", None, cmake_build_type), + cmake_backend_arg(be, "CMAKE_INSTALL_PREFIX", "PATH", install_dir), + cmake_backend_arg(be, "TRITON_COMMON_REPO_TAG", "STRING", components["common"]), + cmake_backend_arg(be, "TRITON_CORE_REPO_TAG", "STRING", components["core"]), + cmake_backend_arg( + be, "TRITON_BACKEND_REPO_TAG", "STRING", components["backend"] + ), ] - cargs.append(cmake_backend_enable(be, 'TRITON_ENABLE_GPU', - FLAGS.enable_gpu)) + cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_GPU", FLAGS.enable_gpu)) cargs.append( - cmake_backend_enable(be, 'TRITON_ENABLE_MALI_GPU', - FLAGS.enable_mali_gpu)) - cargs.append( - cmake_backend_enable(be, 'TRITON_ENABLE_STATS', FLAGS.enable_stats)) + cmake_backend_enable(be, "TRITON_ENABLE_MALI_GPU", FLAGS.enable_mali_gpu) + ) + cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_STATS", FLAGS.enable_stats)) cargs.append( - cmake_backend_enable(be, 'TRITON_ENABLE_METRICS', FLAGS.enable_metrics)) + cmake_backend_enable(be, "TRITON_ENABLE_METRICS", FLAGS.enable_metrics) + ) + + # [DLIS-4950] always enable below once Windows image is updated with CUPTI + # cargs.append(cmake_backend_enable(be, 'TRITON_ENABLE_MEMORY_TRACKER', True)) + if (target_platform() == "windows") and (not FLAGS.no_container_build): + print( + "Warning: Detected docker build is used for Windows, backend utility 'device memory tracker' will be disabled due to missing library in CUDA Windows docker image." + ) + cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_MEMORY_TRACKER", False)) + elif target_platform() == "igpu": + print( + "Warning: Detected iGPU build, backend utility 'device memory tracker' will be disabled as iGPU doesn't contain required version of the library." + ) + cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_MEMORY_TRACKER", False)) + elif FLAGS.enable_gpu: + cargs.append(cmake_backend_enable(be, "TRITON_ENABLE_MEMORY_TRACKER", True)) cargs += cmake_backend_extra_args(be) - cargs.append('..') + cargs.append("..") return cargs def pytorch_cmake_args(images): - - # If platform is jetpack do not use docker based build - if target_platform() == 'jetpack': - if 'pytorch' not in library_paths: - raise Exception( - "Must specify library path for pytorch using --library-paths=pytorch:" - ) - pt_lib_path = library_paths['pytorch'] + "/lib" - pt_include_paths = "" - for suffix in [ - 'include/torch', 'include/torch/torch/csrc/api/include', - 'include/torchvision' - ]: - pt_include_paths += library_paths['pytorch'] + '/' + suffix + ';' - cargs = [ - cmake_backend_arg('pytorch', 'TRITON_PYTORCH_INCLUDE_PATHS', None, - pt_include_paths), - cmake_backend_arg('pytorch', 'TRITON_PYTORCH_LIB_PATHS', None, - pt_lib_path), - ] + if "pytorch" in images: + image = images["pytorch"] else: - if "pytorch" in images: - image = images["pytorch"] - else: - image = 'nvcr.io/nvidia/pytorch:{}-py3'.format( - FLAGS.upstream_container_version) - cargs = [ - cmake_backend_arg('pytorch', 'TRITON_PYTORCH_DOCKER_IMAGE', None, - image), - ] + image = "nvcr.io/nvidia/pytorch:{}-py3".format(FLAGS.upstream_container_version) + cargs = [ + cmake_backend_arg("pytorch", "TRITON_PYTORCH_DOCKER_IMAGE", None, image), + ] - if FLAGS.enable_gpu: - cargs.append( - cmake_backend_enable('pytorch', - 'TRITON_PYTORCH_ENABLE_TORCHTRT', True)) + if FLAGS.enable_gpu: cargs.append( - cmake_backend_enable('pytorch', 'TRITON_ENABLE_NVTX', - FLAGS.enable_nvtx)) + cmake_backend_enable("pytorch", "TRITON_PYTORCH_ENABLE_TORCHTRT", True) + ) + cargs.append( + cmake_backend_enable("pytorch", "TRITON_ENABLE_NVTX", FLAGS.enable_nvtx) + ) return cargs def onnxruntime_cmake_args(images, library_paths): cargs = [ - cmake_backend_arg('onnxruntime', 'TRITON_BUILD_ONNXRUNTIME_VERSION', - None, TRITON_VERSION_MAP[FLAGS.version][2]) + cmake_backend_arg( + "onnxruntime", + "TRITON_BUILD_ONNXRUNTIME_VERSION", + None, + TRITON_VERSION_MAP[FLAGS.version][2], + ) ] # TRITON_ENABLE_GPU is already set for all backends in backend_cmake_args() if FLAGS.enable_gpu: cargs.append( - cmake_backend_enable('onnxruntime', - 'TRITON_ENABLE_ONNXRUNTIME_TENSORRT', True)) - - # If platform is jetpack do not use docker based build - if target_platform() == 'jetpack': - if 'onnxruntime' not in library_paths: - raise Exception( - "Must specify library path for onnxruntime using --library-paths=onnxruntime:" + cmake_backend_enable( + "onnxruntime", "TRITON_ENABLE_ONNXRUNTIME_TENSORRT", True + ) + ) + + if target_platform() == "windows": + if "base" in images: + cargs.append( + cmake_backend_arg( + "onnxruntime", "TRITON_BUILD_CONTAINER", None, images["base"] + ) ) - ort_lib_path = library_paths['onnxruntime'] + "/lib" - ort_include_path = library_paths['onnxruntime'] + "/include" - cargs += [ - cmake_backend_arg('onnxruntime', 'TRITON_ONNXRUNTIME_INCLUDE_PATHS', - None, ort_include_path), - cmake_backend_arg('onnxruntime', 'TRITON_ONNXRUNTIME_LIB_PATHS', - None, ort_lib_path), - cmake_backend_enable('onnxruntime', - 'TRITON_ENABLE_ONNXRUNTIME_OPENVINO', False) - ] else: - if target_platform() == 'windows': - if 'base' in images: - cargs.append( - cmake_backend_arg('onnxruntime', 'TRITON_BUILD_CONTAINER', - None, images['base'])) + if "base" in images: + cargs.append( + cmake_backend_arg( + "onnxruntime", "TRITON_BUILD_CONTAINER", None, images["base"] + ) + ) else: - if 'base' in images: - cargs.append( - cmake_backend_arg('onnxruntime', 'TRITON_BUILD_CONTAINER', - None, images['base'])) - else: - cargs.append( - cmake_backend_arg('onnxruntime', - 'TRITON_BUILD_CONTAINER_VERSION', None, - TRITON_VERSION_MAP[FLAGS.version][1])) - - if ((target_machine() != 'aarch64') and - (TRITON_VERSION_MAP[FLAGS.version][3] is not None)): - cargs.append( - cmake_backend_enable('onnxruntime', - 'TRITON_ENABLE_ONNXRUNTIME_OPENVINO', - True)) - cargs.append( - cmake_backend_arg( - 'onnxruntime', - 'TRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION', None, - TRITON_VERSION_MAP[FLAGS.version][3])) + cargs.append( + cmake_backend_arg( + "onnxruntime", + "TRITON_BUILD_CONTAINER_VERSION", + None, + TRITON_VERSION_MAP[FLAGS.version][1], + ) + ) - return cargs + if (target_machine() != "aarch64") and ( + TRITON_VERSION_MAP[FLAGS.version][3] is not None + ): + cargs.append( + cmake_backend_enable( + "onnxruntime", "TRITON_ENABLE_ONNXRUNTIME_OPENVINO", True + ) + ) + cargs.append( + cmake_backend_arg( + "onnxruntime", + "TRITON_BUILD_ONNXRUNTIME_OPENVINO_VERSION", + None, + TRITON_VERSION_MAP[FLAGS.version][3], + ) + ) + if target_platform() == "igpu": + cargs.append( + cmake_backend_arg( + "onnxruntime", + "TRITON_BUILD_TARGET_PLATFORM", + None, + target_platform(), + ) + ) -def openvino_cmake_args(be, variant_index): - using_specific_commit_sha = False - if TRITON_VERSION_MAP[FLAGS.version][4][variant_index][0] == 'SPECIFIC': - using_specific_commit_sha = True + return cargs - ov_version = TRITON_VERSION_MAP[FLAGS.version][4][variant_index][1] - if ov_version: - if using_specific_commit_sha: - use_prebuilt_ov = False - else: - use_prebuilt_ov = True - else: - # If the OV package version is None, then we are not using prebuilt package - ov_version = TRITON_VERSION_MAP[FLAGS.version][4][variant_index][0] - use_prebuilt_ov = False - if using_specific_commit_sha: - cargs = [ - cmake_backend_arg(be, 'TRITON_BUILD_OPENVINO_COMMIT_SHA', None, - ov_version), - ] - else: - cargs = [ - cmake_backend_arg(be, 'TRITON_BUILD_OPENVINO_VERSION', None, - ov_version), - ] - cargs.append( - cmake_backend_arg(be, 'TRITON_OPENVINO_BACKEND_INSTALLDIR', None, be)) - if target_platform() == 'windows': - if 'base' in images: + +def openvino_cmake_args(): + cargs = [ + cmake_backend_arg( + "openvino", + "TRITON_BUILD_OPENVINO_VERSION", + None, + TRITON_VERSION_MAP[FLAGS.version][4], + ) + ] + if target_platform() == "windows": + if "base" in images: cargs.append( - cmake_backend_arg(be, 'TRITON_BUILD_CONTAINER', None, - images['base'])) + cmake_backend_arg( + "openvino", "TRITON_BUILD_CONTAINER", None, images["base"] + ) + ) else: - if 'base' in images: + if "base" in images: cargs.append( - cmake_backend_arg(be, 'TRITON_BUILD_CONTAINER', None, - images['base'])) + cmake_backend_arg( + "openvino", "TRITON_BUILD_CONTAINER", None, images["base"] + ) + ) else: cargs.append( - cmake_backend_arg(be, 'TRITON_BUILD_CONTAINER_VERSION', None, - TRITON_VERSION_MAP[FLAGS.version][1])) - cargs.append( - cmake_backend_enable(be, 'TRITON_BUILD_USE_PREBUILT_OPENVINO', - use_prebuilt_ov)) + cmake_backend_arg( + "openvino", + "TRITON_BUILD_CONTAINER_VERSION", + None, + TRITON_VERSION_MAP[FLAGS.version][1], + ) + ) return cargs def tensorrt_cmake_args(): cargs = [ - cmake_backend_enable('tensorrt', 'TRITON_ENABLE_NVTX', - FLAGS.enable_nvtx), + cmake_backend_enable("tensorrt", "TRITON_ENABLE_NVTX", FLAGS.enable_nvtx), ] - if target_platform() == 'windows': + if target_platform() == "windows": cargs.append( - cmake_backend_arg('tensorrt', 'TRITON_TENSORRT_INCLUDE_PATHS', None, - 'c:/TensorRT/include')) + cmake_backend_arg( + "tensorrt", "TRITON_TENSORRT_INCLUDE_PATHS", None, "c:/TensorRT/include" + ) + ) return cargs -def tensorflow_cmake_args(ver, images, library_paths): - backend_name = "tensorflow{}".format(ver) - - # If platform is jetpack do not use docker images +def tensorflow_cmake_args(images, library_paths): + backend_name = "tensorflow" extra_args = [] - if target_platform() == 'jetpack': - if backend_name in library_paths: - extra_args = [ - cmake_backend_arg(backend_name, 'TRITON_TENSORFLOW_LIB_PATHS', - None, library_paths[backend_name]) - ] - else: - raise Exception( - f"Must specify library path for {backend_name} using --library-paths={backend_name}:" - ) + + # If a specific TF image is specified use it, otherwise pull from NGC. + if backend_name in images: + image = images[backend_name] else: - # If a specific TF image is specified use it, otherwise pull from NGC. - if backend_name in images: - image = images[backend_name] - else: - image = 'nvcr.io/nvidia/tensorflow:{}-tf{}-py3'.format( - FLAGS.upstream_container_version, ver) - extra_args = [ - cmake_backend_arg(backend_name, 'TRITON_TENSORFLOW_DOCKER_IMAGE', - None, image) - ] - return [ - cmake_backend_arg(backend_name, 'TRITON_TENSORFLOW_VERSION', None, ver) - ] + extra_args + image = "nvcr.io/nvidia/tensorflow:{}-tf2-py3".format( + FLAGS.upstream_container_version + ) + extra_args = [ + cmake_backend_arg(backend_name, "TRITON_TENSORFLOW_DOCKER_IMAGE", None, image) + ] + return extra_args def dali_cmake_args(): return [ - cmake_backend_enable('dali', 'TRITON_DALI_SKIP_DOWNLOAD', False), + cmake_backend_enable("dali", "TRITON_DALI_SKIP_DOWNLOAD", False), ] def fil_cmake_args(images): - cargs = [cmake_backend_enable('fil', 'TRITON_FIL_DOCKER_BUILD', True)] - if 'base' in images: + cargs = [cmake_backend_enable("fil", "TRITON_FIL_DOCKER_BUILD", True)] + if "base" in images: cargs.append( - cmake_backend_arg('fil', 'TRITON_BUILD_CONTAINER', None, - images['base'])) + cmake_backend_arg("fil", "TRITON_BUILD_CONTAINER", None, images["base"]) + ) else: cargs.append( - cmake_backend_arg('fil', 'TRITON_BUILD_CONTAINER_VERSION', None, - TRITON_VERSION_MAP[FLAGS.version][1])) + cmake_backend_arg( + "fil", + "TRITON_BUILD_CONTAINER_VERSION", + None, + TRITON_VERSION_MAP[FLAGS.version][1], + ) + ) return cargs def armnn_tflite_cmake_args(): return [ - cmake_backend_arg('armnn_tflite', 'JOBS', None, - multiprocessing.cpu_count()), + cmake_backend_arg("armnn_tflite", "JOBS", None, multiprocessing.cpu_count()), + ] + + +def fastertransformer_cmake_args(): + print("Warning: FasterTransformer backend is not officially supported.") + cargs = [ + cmake_backend_arg( + "fastertransformer", "CMAKE_EXPORT_COMPILE_COMMANDS", None, 1 + ), + cmake_backend_arg("fastertransformer", "ENABLE_FP8", None, "OFF"), ] + return cargs + + +def tensorrtllm_cmake_args(images): + cargs = [ + cmake_backend_arg( + "tensorrtllm", + "TRT_LIB_DIR", + None, + "${TRT_ROOT}/targets/${ARCH}-linux-gnu/lib", + ), + cmake_backend_arg( + "tensorrtllm", "TRT_INCLUDE_DIR", None, "${TRT_ROOT}/include" + ), + cmake_backend_arg( + "tensorrtllm", + "TRTLLM_BUILD_CONTAINER", + None, + images["base"], + ), + ] + cargs.append(cmake_backend_enable("tensorrtllm", "TRITON_BUILD", True)) + return cargs def install_dcgm_libraries(dcgm_version, target_machine): - if dcgm_version == '': + if dcgm_version == "": fail( - 'unable to determine default repo-tag, DCGM version not known for {}' - .format(FLAGS.version)) - return '' + "unable to determine default repo-tag, DCGM version not known for {}".format( + FLAGS.version + ) + ) + return "" else: - if target_machine == 'aarch64': - return ''' + if target_machine == "aarch64": + return """ ENV DCGM_VERSION {} # Install DCGM. Steps from https://developer.nvidia.com/dcgm#Downloads RUN curl -o /tmp/cuda-keyring.deb \ - https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda-keyring_1.0-1_all.deb \ + https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/cuda-keyring_1.0-1_all.deb \ && apt install /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb && \ apt-get update && apt-get install -y datacenter-gpu-manager=1:{} -'''.format(dcgm_version, dcgm_version) +""".format( + dcgm_version, dcgm_version + ) else: - return ''' + return """ ENV DCGM_VERSION {} # Install DCGM. Steps from https://developer.nvidia.com/dcgm#Downloads RUN curl -o /tmp/cuda-keyring.deb \ - https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb \ + https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb \ && apt install /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb && \ apt-get update && apt-get install -y datacenter-gpu-manager=1:{} -'''.format(dcgm_version, dcgm_version) +""".format( + dcgm_version, dcgm_version + ) def install_miniconda(conda_version, target_machine): - if conda_version == '': + if target_machine == "arm64": + # This branch used for the case when linux container builds on MacOS with ARM chip + # macos arm arch names "arm64" when in linux it's names "aarch64". + # So we just replace the architecture to able find right conda version for Linux + target_machine = "aarch64" + if conda_version == "": fail( - 'unable to determine default repo-tag, CONDA version not known for {}' - .format(FLAGS.version)) + "unable to determine default repo-tag, CONDA version not known for {}".format( + FLAGS.version + ) + ) miniconda_url = f"https://repo.anaconda.com/miniconda/Miniconda3-{conda_version}-Linux-{target_machine}.sh" - if target_machine == 'x86_64': - sha_sum = "3190da6626f86eee8abf1b2fd7a5af492994eb2667357ee4243975cdbb175d7a" + if target_machine == "x86_64": + sha_sum = "32d73e1bc33fda089d7cd9ef4c1be542616bd8e437d1f77afeeaf7afdb019787" else: - sha_sum = "0c20f121dc4c8010032d64f8e9b27d79e52d28355eb8d7972eafc90652387777" - return f''' + sha_sum = "80d6c306b015e1e3b01ea59dc66c676a81fa30279bc2da1f180a7ef7b2191d6e" + return f""" RUN mkdir -p /opt/ RUN wget "{miniconda_url}" -O miniconda.sh -q && \ echo "{sha_sum}" "miniconda.sh" > shasum && \ @@ -863,52 +894,68 @@ def install_miniconda(conda_version, target_machine): find /opt/conda/ -follow -type f -name '*.js.map' -delete && \ /opt/conda/bin/conda clean -afy ENV PATH /opt/conda/bin:${{PATH}} -''' +""" def create_dockerfile_buildbase(ddir, dockerfile_name, argmap): - df = ''' + df = """ ARG TRITON_VERSION={} ARG TRITON_CONTAINER_VERSION={} ARG BASE_IMAGE={} -'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'], - argmap['BASE_IMAGE']) +""".format( + argmap["TRITON_VERSION"], + argmap["TRITON_CONTAINER_VERSION"], + argmap["BASE_IMAGE"], + ) - df += ''' + df += """ FROM ${BASE_IMAGE} ARG TRITON_VERSION ARG TRITON_CONTAINER_VERSION -''' +""" # Install the windows- or linux-specific buildbase dependencies - if target_platform() == 'windows': - df += ''' + if target_platform() == "windows": + df += """ SHELL ["cmd", "/S", "/C"] -''' +""" else: - df += ''' + df += """ # Ensure apt-get won't prompt for selecting options ENV DEBIAN_FRONTEND=noninteractive +# Install docker docker buildx +RUN apt-get update \ + && apt-get install -y ca-certificates curl gnupg \ + && install -m 0755 -d /etc/apt/keyrings \ + && curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg \ + && chmod a+r /etc/apt/keyrings/docker.gpg \ + && echo \ + "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ + "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \ + tee /etc/apt/sources.list.d/docker.list > /dev/null \ + && apt-get update \ + && apt-get install -y docker.io docker-buildx-plugin + # libcurl4-openSSL-dev is needed for GCS # python3-dev is needed by Torchvision # python3-pip and libarchive-dev is needed by python backend -# uuid-dev and pkg-config is needed for Azure Storage +# libxml2-dev is needed for Azure Storage # scons is needed for armnn_tflite backend build dep -RUN apt-get update && \ - apt-get install -y --no-install-recommends \ +RUN apt-get update \ + && apt-get install -y --no-install-recommends \ ca-certificates \ autoconf \ automake \ build-essential \ - docker.io \ git \ + gperf \ libre2-dev \ libssl-dev \ libtool \ - libboost-dev \ libcurl4-openssl-dev \ libb64-dev \ + libgoogle-perftools-dev \ patchelf \ python3-dev \ python3-pip \ @@ -916,70 +963,81 @@ def create_dockerfile_buildbase(ddir, dockerfile_name, argmap): rapidjson-dev \ scons \ software-properties-common \ + pkg-config \ unzip \ wget \ zlib1g-dev \ libarchive-dev \ - pkg-config \ - uuid-dev \ - libnuma-dev && \ - rm -rf /var/lib/apt/lists/* + libxml2-dev \ + libnuma-dev \ + wget \ + && rm -rf /var/lib/apt/lists/* RUN pip3 install --upgrade pip && \ pip3 install --upgrade wheel setuptools docker +# Install boost version >= 1.78 for boost::span +# Current libboost-dev apt packages are < 1.78, so install from tar.gz +RUN wget -O /tmp/boost.tar.gz \ + https://boostorg.jfrog.io/artifactory/main/release/1.80.0/source/boost_1_80_0.tar.gz && \ + (cd /tmp && tar xzf boost.tar.gz) && \ + cd /tmp/boost_1_80_0 && ./bootstrap.sh --prefix=/usr && ./b2 install && \ + mv /tmp/boost_1_80_0/boost /usr/include/boost + # Server build requires recent version of CMake (FetchContent required) -RUN wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \ - gpg --dearmor - | \ - tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \ - apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \ - apt-get update && \ - apt-get install -y --no-install-recommends \ - cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1 -''' +RUN apt update -q=2 \\ + && apt install -y gpg wget \\ + && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \\ + && . /etc/os-release \\ + && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \\ + && apt-get update -q=2 \\ + && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* +""" if FLAGS.enable_gpu: - df += install_dcgm_libraries(argmap['DCGM_VERSION'], - target_machine()) + df += install_dcgm_libraries(argmap["DCGM_VERSION"], target_machine()) - df += ''' + df += """ ENV TRITON_SERVER_VERSION ${TRITON_VERSION} ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION} -''' +""" # Copy in the triton source. We remove existing contents first in # case the FROM container has something there already. - if target_platform() == 'windows': - df += ''' + if target_platform() == "windows": + df += """ WORKDIR /workspace RUN rmdir /S/Q * || exit 0 COPY . . -''' +""" else: - df += ''' + df += """ WORKDIR /workspace RUN rm -fr * COPY . . ENTRYPOINT [] -''' +""" # Install miniconda required for the DALI backend. - if target_platform() != 'windows': - df += install_miniconda(argmap['CONDA_VERSION'], target_machine()) + if target_platform() != "windows": + df += install_miniconda(argmap["CONDA_VERSION"], target_machine()) with open(os.path.join(ddir, dockerfile_name), "w") as dfile: dfile.write(df) def create_dockerfile_cibase(ddir, dockerfile_name, argmap): - df = ''' + df = """ ARG TRITON_VERSION={} ARG TRITON_CONTAINER_VERSION={} ARG BASE_IMAGE={} -'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'], - argmap['BASE_IMAGE']) +""".format( + argmap["TRITON_VERSION"], + argmap["TRITON_CONTAINER_VERSION"], + argmap["BASE_IMAGE"], + ) - df += ''' + df += """ FROM ${BASE_IMAGE} ARG TRITON_VERSION @@ -991,80 +1049,84 @@ def create_dockerfile_cibase(ddir, dockerfile_name, argmap): ENV TRITON_SERVER_VERSION ${TRITON_VERSION} ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION} -''' +""" with open(os.path.join(ddir, dockerfile_name), "w") as dfile: dfile.write(df) -def create_dockerfile_linux(ddir, dockerfile_name, argmap, backends, repoagents, - endpoints): - df = ''' +def create_dockerfile_linux( + ddir, dockerfile_name, argmap, backends, repoagents, caches, endpoints +): + df = """ ARG TRITON_VERSION={} ARG TRITON_CONTAINER_VERSION={} ARG BASE_IMAGE={} -'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'], - argmap['BASE_IMAGE']) +""".format( + argmap["TRITON_VERSION"], + argmap["TRITON_CONTAINER_VERSION"], + argmap["BASE_IMAGE"], + ) - # PyTorch, TensorFlow 1 and TensorFlow 2 backends need extra CUDA and other + # PyTorch and TensorFlow backends need extra CUDA and other # dependencies during runtime that are missing in the CPU-only base container. # These dependencies must be copied from the Triton Min image. - if not FLAGS.enable_gpu and (('pytorch' in backends) or - ('tensorflow1' in backends) or - ('tensorflow2' in backends)): - df += ''' + if not FLAGS.enable_gpu and (("pytorch" in backends) or ("tensorflow" in backends)): + df += """ ############################################################################ ## Triton Min image ############################################################################ FROM {} AS min_container -'''.format(argmap['GPU_BASE_IMAGE']) +""".format( + argmap["GPU_BASE_IMAGE"] + ) - df += ''' + df += """ ############################################################################ ## Production stage: Create container with just inference server executable ############################################################################ FROM ${BASE_IMAGE} -''' +""" - df += dockerfile_prepare_container_linux(argmap, backends, FLAGS.enable_gpu, - target_machine()) + df += dockerfile_prepare_container_linux( + argmap, backends, FLAGS.enable_gpu, target_machine() + ) - df += ''' + df += """ WORKDIR /opt COPY --chown=1000:1000 build/install tritonserver WORKDIR /opt/tritonserver COPY --chown=1000:1000 NVIDIA_Deep_Learning_Container_License.pdf . -''' +""" if not FLAGS.no_core_build: # Add feature labels for SageMaker endpoint - if 'sagemaker' in endpoints: - df += ''' + if "sagemaker" in endpoints: + df += """ LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true LABEL com.amazonaws.sagemaker.capabilities.multi-models=true COPY --chown=1000:1000 docker/sagemaker/serve /usr/bin/. -''' +""" # This is required since libcublasLt.so is not present during the build # stage of the PyTorch backend - if not FLAGS.enable_gpu and ('pytorch' in backends): - df += ''' -RUN patchelf --add-needed /usr/local/cuda/lib64/stubs/libcublasLt.so.11 backends/pytorch/libtorch_cuda.so -''' + if not FLAGS.enable_gpu and ("pytorch" in backends): + df += """ +RUN patchelf --add-needed /usr/local/cuda/lib64/stubs/libcublasLt.so.12 backends/pytorch/libtorch_cuda.so +""" with open(os.path.join(ddir, dockerfile_name), "w") as dfile: dfile.write(df) -def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, - target_machine): +def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, target_machine): gpu_enabled = 1 if enable_gpu else 0 # Common steps to produce docker images shared by build.py and compose.py. - # Sets enviroment variables, installs dependencies and adds entrypoint - df = ''' + # Sets environment variables, installs dependencies and adds entrypoint + df = """ ARG TRITON_VERSION ARG TRITON_CONTAINER_VERSION @@ -1073,24 +1135,36 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, LABEL com.nvidia.tritonserver.version="${TRITON_SERVER_VERSION}" ENV PATH /opt/tritonserver/bin:${PATH} -''' +# Remove once https://github.com/openucx/ucx/pull/9148 is available +# in the min container. +ENV UCX_MEM_EVENTS no +""" # TODO Remove once the ORT-OpenVINO "Exception while Reading network" is fixed - if 'onnxruntime' in backends: - df += ''' + if "onnxruntime" in backends: + df += """ ENV LD_LIBRARY_PATH /opt/tritonserver/backends/onnxruntime:${LD_LIBRARY_PATH} -''' +""" + + # Necessary for libtorch.so to find correct HPCX libraries + if "pytorch" in backends: + df += """ +ENV LD_LIBRARY_PATH /opt/hpcx/ucc/lib/:/opt/hpcx/ucx/lib/:${LD_LIBRARY_PATH} +""" backend_dependencies = "" # libgomp1 is needed by both onnxruntime and pytorch backends - if ('onnxruntime' in backends) or ('pytorch' in backends): + if ("onnxruntime" in backends) or ("pytorch" in backends): backend_dependencies = "libgomp1" # libgfortran5 is needed by pytorch backend on ARM - if ('pytorch' in backends) and (target_machine == 'aarch64'): + if ("pytorch" in backends) and (target_machine == "aarch64"): backend_dependencies += " libgfortran5" + # openssh-server is needed for fastertransformer + if "fastertransformer" in backends: + backend_dependencies += " openssh-server" - df += ''' + df += """ ENV TF_ADJUST_HUE_FUSED 1 ENV TF_ADJUST_SATURATION_FUSED 1 ENV TF_ENABLE_WINOGRAD_NONFUSED 1 @@ -1113,76 +1187,68 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, # Common dependencies. FIXME (can any of these be conditional? For # example libcurl only needed for GCS?) -RUN apt-get update && \ - apt-get install -y --no-install-recommends \ - software-properties-common \ +RUN apt-get update \ + && apt-get install -y --no-install-recommends \ + clang \ + curl \ + dirmngr \ + git \ + gperf \ libb64-0d \ libcurl4-openssl-dev \ - libre2-5 \ - git \ - dirmngr \ + libgoogle-perftools-dev \ + libjemalloc-dev \ libnuma-dev \ - curl \ - {backend_dependencies} && \ - rm -rf /var/lib/apt/lists/* -'''.format(gpu_enabled=gpu_enabled, backend_dependencies=backend_dependencies) + libre2-9 \ + software-properties-common \ + wget \ + {backend_dependencies} \ + && rm -rf /var/lib/apt/lists/* + +# Install boost version >= 1.78 for boost::span +# Current libboost-dev apt packages are < 1.78, so install from tar.gz +RUN wget -O /tmp/boost.tar.gz \ + https://boostorg.jfrog.io/artifactory/main/release/1.80.0/source/boost_1_80_0.tar.gz \ + && (cd /tmp && tar xzf boost.tar.gz) \ + && cd /tmp/boost_1_80_0 \ + && ./bootstrap.sh --prefix=/usr \ + && ./b2 install \ + && rm -rf /tmp/boost* + +# Set TCMALLOC_RELEASE_RATE for users setting LD_PRELOAD with tcmalloc +ENV TCMALLOC_RELEASE_RATE 200 +""".format( + gpu_enabled=gpu_enabled, backend_dependencies=backend_dependencies + ) + + if "fastertransformer" in backends: + be = "fastertransformer" + url = "https://raw.githubusercontent.com/triton-inference-server/fastertransformer_backend/{}/docker/create_dockerfile_and_build.py".format( + backends[be] + ) + response = requests.get(url) + spec = importlib.util.spec_from_loader( + "fastertransformer_buildscript", loader=None, origin=url + ) + fastertransformer_buildscript = importlib.util.module_from_spec(spec) + exec(response.content, fastertransformer_buildscript.__dict__) + df += fastertransformer_buildscript.create_postbuild(is_multistage_build=False) if enable_gpu: - df += install_dcgm_libraries(argmap['DCGM_VERSION'], target_machine) - df += ''' + df += install_dcgm_libraries(argmap["DCGM_VERSION"], target_machine) + df += """ # Extra defensive wiring for CUDA Compat lib RUN ln -sf ${_CUDA_COMPAT_PATH}/lib.real ${_CUDA_COMPAT_PATH}/lib \ && echo ${_CUDA_COMPAT_PATH}/lib > /etc/ld.so.conf.d/00-cuda-compat.conf \ && ldconfig \ && rm -f ${_CUDA_COMPAT_PATH}/lib -''' - +""" else: - libs_arch = 'aarch64' if target_machine == 'aarch64' else 'x86_64' - if ('pytorch' in backends) or ('tensorflow1' in backends): - # Add extra dependencies for tensorflow1/pytorch backend. - # Note: Even though the build is CPU-only, the version of tensorflow1/ - # pytorch we are using depend upon libraries like cuda and cudnn. Since - # these dependencies are not present in the ubuntu base image, - # we must copy these from the Triton min container ourselves. - cuda_arch = 'sbsa' if target_machine == 'aarch64' else 'x86_64' - df += ''' -RUN mkdir -p /usr/local/cuda/lib64/stubs -COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusparse.so /usr/local/cuda/lib64/stubs/libcusparse.so.11 -COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusolver.so /usr/local/cuda/lib64/stubs/libcusolver.so.11 -COPY --from=min_container /usr/local/cuda/lib64/stubs/libcurand.so /usr/local/cuda/lib64/stubs/libcurand.so.10 -COPY --from=min_container /usr/local/cuda/lib64/stubs/libcufft.so /usr/local/cuda/lib64/stubs/libcufft.so.10 -COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublas.so /usr/local/cuda/lib64/stubs/libcublas.so.11 -COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublasLt.so /usr/local/cuda/lib64/stubs/libcublasLt.so.11 - -RUN mkdir -p /usr/local/cuda/targets/{cuda_arch}-linux/lib -COPY --from=min_container /usr/local/cuda-11.7/targets/{cuda_arch}-linux/lib/libcudart.so.11.0 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. -COPY --from=min_container /usr/local/cuda-11.7/targets/{cuda_arch}-linux/lib/libcupti.so.11.7 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. -COPY --from=min_container /usr/local/cuda-11.7/targets/{cuda_arch}-linux/lib/libnvToolsExt.so.1 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. - -COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 - -# patchelf is needed to add deps of libcublasLt.so.11 to libtorch_cuda.so -RUN apt-get update && \ - apt-get install -y --no-install-recommends openmpi-bin patchelf - -ENV LD_LIBRARY_PATH /usr/local/cuda/targets/{cuda_arch}-linux/lib:/usr/local/cuda/lib64/stubs:${{LD_LIBRARY_PATH}} -'''.format(cuda_arch=cuda_arch, libs_arch=libs_arch) - - if ('pytorch' in backends) or ('tensorflow1' in backends) \ - or ('tensorflow2' in backends): - # Add NCCL dependency for tensorflow1/tensorflow2/pytorch backend. - # Note: Even though the build is CPU-only, the version of tensorflow1/ - # tensorflow2/pytorch we are using depends upon the NCCL library. Since - # this dependency is not present in the ubuntu base image, we must - # copy it from the Triton min container ourselves. - df += ''' -COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2 /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2 -'''.format(libs_arch=libs_arch) + df += add_cpu_libs_to_linux_dockerfile(backends, target_machine) # Add dependencies needed for python backend - if 'python' in backends: - df += ''' + if "python" in backends: + df += """ # python3, python3-pip and some pip installs required for the python backend RUN apt-get update && \ apt-get install -y --no-install-recommends \ @@ -1193,37 +1259,122 @@ def dockerfile_prepare_container_linux(argmap, backends, enable_gpu, pip3 install --upgrade wheel setuptools && \ pip3 install --upgrade numpy && \ rm -rf /var/lib/apt/lists/* -''' +""" + # Add dependencies needed for tensorrtllm backend + if "tensorrtllm" in backends: + be = "tensorrtllm" + url = "https://raw.githubusercontent.com/triton-inference-server/tensorrtllm_backend/{}/tools/gen_trtllm_dockerfile.py".format( + backends[be] + ) + + response = requests.get(url) + spec = importlib.util.spec_from_loader( + "trtllm_buildscript", loader=None, origin=url + ) + trtllm_buildscript = importlib.util.module_from_spec(spec) + exec(response.content, trtllm_buildscript.__dict__) + df += trtllm_buildscript.create_postbuild(backends[be]) + + if "vllm" in backends: + # [DLIS-5606] Build Conda environment for vLLM backend + # Remove Pip install once vLLM backend moves to Conda environment. + df += """ +# vLLM needed for vLLM backend +RUN pip3 install vllm=={} +""".format( + TRITON_VERSION_MAP[FLAGS.version][7] + ) - df += ''' + df += """ WORKDIR /opt/tritonserver RUN rm -fr /opt/tritonserver/* ENV NVIDIA_PRODUCT_NAME="Triton Server" COPY docker/entrypoint.d/ /opt/nvidia/entrypoint.d/ -''' +""" # The CPU-only build uses ubuntu as the base image, and so the # entrypoint files are not available in /opt/nvidia in the base # image, so we must provide them ourselves. if not enable_gpu: - df += ''' + df += """ COPY docker/cpu_only/ /opt/nvidia/ ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.sh"] -''' +""" - df += ''' + df += """ ENV NVIDIA_BUILD_ID {} LABEL com.nvidia.build.id={} LABEL com.nvidia.build.ref={} -'''.format(argmap['NVIDIA_BUILD_ID'], argmap['NVIDIA_BUILD_ID'], - argmap['NVIDIA_BUILD_REF']) +""".format( + argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_REF"] + ) + + return df + + +def add_cpu_libs_to_linux_dockerfile(backends, target_machine): + df = "" + libs_arch = "aarch64" if target_machine == "aarch64" else "x86_64" + if "pytorch" in backends: + # Add extra dependencies for pytorch backend. + # Note: Even though the build is CPU-only, the version of pytorch + # we are using depend upon libraries like cuda and cudnn. Since + # these dependencies are not present in the ubuntu base image, + # we must copy these from the Triton min container ourselves. + cuda_arch = "sbsa" if target_machine == "aarch64" else "x86_64" + df += """ +RUN mkdir -p /usr/local/cuda/lib64/stubs +COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusparse.so /usr/local/cuda/lib64/stubs/libcusparse.so.12 +COPY --from=min_container /usr/local/cuda/lib64/stubs/libcusolver.so /usr/local/cuda/lib64/stubs/libcusolver.so.11 +COPY --from=min_container /usr/local/cuda/lib64/stubs/libcurand.so /usr/local/cuda/lib64/stubs/libcurand.so.10 +COPY --from=min_container /usr/local/cuda/lib64/stubs/libcufft.so /usr/local/cuda/lib64/stubs/libcufft.so.11 +COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublas.so /usr/local/cuda/lib64/stubs/libcublas.so.12 +COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublasLt.so /usr/local/cuda/lib64/stubs/libcublasLt.so.12 +COPY --from=min_container /usr/local/cuda/lib64/stubs/libcublasLt.so /usr/local/cuda/lib64/stubs/libcublasLt.so.11 + +RUN mkdir -p /usr/local/cuda/targets/{cuda_arch}-linux/lib +COPY --from=min_container /usr/local/cuda/lib64/libcudart.so.12 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. +COPY --from=min_container /usr/local/cuda/lib64/libcupti.so.12 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. +COPY --from=min_container /usr/local/cuda/lib64/libnvToolsExt.so.1 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. +COPY --from=min_container /usr/local/cuda/lib64/libnvJitLink.so.12 /usr/local/cuda/targets/{cuda_arch}-linux/lib/. + +RUN mkdir -p /opt/hpcx/ucc/lib/ /opt/hpcx/ucx/lib/ +COPY --from=min_container /opt/hpcx/ucc/lib/libucc.so.1 /opt/hpcx/ucc/lib/libucc.so.1 +COPY --from=min_container /opt/hpcx/ucx/lib/libucm.so.0 /opt/hpcx/ucx/lib/libucm.so.0 +COPY --from=min_container /opt/hpcx/ucx/lib/libucp.so.0 /opt/hpcx/ucx/lib/libucp.so.0 +COPY --from=min_container /opt/hpcx/ucx/lib/libucs.so.0 /opt/hpcx/ucx/lib/libucs.so.0 +COPY --from=min_container /opt/hpcx/ucx/lib/libuct.so.0 /opt/hpcx/ucx/lib/libuct.so.0 + +COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 /usr/lib/{libs_arch}-linux-gnu/libcudnn.so.8 + +# patchelf is needed to add deps of libcublasLt.so.12 to libtorch_cuda.so +RUN apt-get update && \ + apt-get install -y --no-install-recommends openmpi-bin patchelf + +ENV LD_LIBRARY_PATH /usr/local/cuda/targets/{cuda_arch}-linux/lib:/usr/local/cuda/lib64/stubs:${{LD_LIBRARY_PATH}} +""".format( + cuda_arch=cuda_arch, libs_arch=libs_arch + ) + + if ("pytorch" in backends) or ("tensorflow" in backends): + # Add NCCL dependency for tensorflow/pytorch backend. + # Note: Even though the build is CPU-only, the version of + # tensorflow/pytorch we are using depends upon the NCCL library. + # Since this dependency is not present in the ubuntu base image, + # we must copy it from the Triton min container ourselves. + df += """ +COPY --from=min_container /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2 /usr/lib/{libs_arch}-linux-gnu/libnccl.so.2 +""".format( + libs_arch=libs_arch + ) return df -def create_dockerfile_windows(ddir, dockerfile_name, argmap, backends, - repoagents): - df = ''' +def create_dockerfile_windows( + ddir, dockerfile_name, argmap, backends, repoagents, caches +): + df = """ ARG TRITON_VERSION={} ARG TRITON_CONTAINER_VERSION={} ARG BASE_IMAGE={} @@ -1242,9 +1393,12 @@ def create_dockerfile_windows(ddir, dockerfile_name, argmap, backends, RUN setx path "%path%;C:\opt\tritonserver\bin" -'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'], - argmap['BASE_IMAGE']) - df += ''' +""".format( + argmap["TRITON_VERSION"], + argmap["TRITON_CONTAINER_VERSION"], + argmap["BASE_IMAGE"], + ) + df += """ WORKDIR /opt RUN rmdir /S/Q tritonserver || exit 0 COPY --chown=1000:1000 build/install tritonserver @@ -1252,118 +1406,136 @@ def create_dockerfile_windows(ddir, dockerfile_name, argmap, backends, WORKDIR /opt/tritonserver COPY --chown=1000:1000 NVIDIA_Deep_Learning_Container_License.pdf . -''' - df += ''' +""" + df += """ ENTRYPOINT [] ENV NVIDIA_BUILD_ID {} LABEL com.nvidia.build.id={} LABEL com.nvidia.build.ref={} -'''.format(argmap['NVIDIA_BUILD_ID'], argmap['NVIDIA_BUILD_ID'], - argmap['NVIDIA_BUILD_REF']) +""".format( + argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_ID"], argmap["NVIDIA_BUILD_REF"] + ) with open(os.path.join(ddir, dockerfile_name), "w") as dfile: dfile.write(df) -def create_build_dockerfiles(container_build_dir, images, backends, repoagents, - endpoints): - if 'base' in images: - base_image = images['base'] - elif target_platform() == 'windows': - base_image = 'mcr.microsoft.com/dotnet/framework/sdk:4.8' +def create_build_dockerfiles( + container_build_dir, images, backends, repoagents, caches, endpoints +): + if "base" in images: + base_image = images["base"] + elif target_platform() == "windows": + base_image = "mcr.microsoft.com/dotnet/framework/sdk:4.8" elif FLAGS.enable_gpu: - base_image = 'nvcr.io/nvidia/tritonserver:{}-py3-min'.format( - FLAGS.upstream_container_version) + base_image = "nvcr.io/nvidia/tritonserver:{}-py3-min".format( + FLAGS.upstream_container_version + ) else: - base_image = 'ubuntu:20.04' + base_image = "ubuntu:22.04" dockerfileargmap = { - 'NVIDIA_BUILD_REF': - '' if FLAGS.build_sha is None else FLAGS.build_sha, - 'NVIDIA_BUILD_ID': - '' if FLAGS.build_id is None else FLAGS.build_id, - 'TRITON_VERSION': - FLAGS.version, - 'TRITON_CONTAINER_VERSION': - FLAGS.container_version, - 'BASE_IMAGE': - base_image, - 'DCGM_VERSION': - '' if FLAGS.version is None or FLAGS.version - not in TRITON_VERSION_MAP else TRITON_VERSION_MAP[FLAGS.version][5], - 'CONDA_VERSION': - '' if FLAGS.version is None or FLAGS.version - not in TRITON_VERSION_MAP else TRITON_VERSION_MAP[FLAGS.version][6] + "NVIDIA_BUILD_REF": "" if FLAGS.build_sha is None else FLAGS.build_sha, + "NVIDIA_BUILD_ID": "" if FLAGS.build_id is None else FLAGS.build_id, + "TRITON_VERSION": FLAGS.version, + "TRITON_CONTAINER_VERSION": FLAGS.container_version, + "BASE_IMAGE": base_image, + "DCGM_VERSION": "" + if FLAGS.version is None or FLAGS.version not in TRITON_VERSION_MAP + else TRITON_VERSION_MAP[FLAGS.version][5], + "CONDA_VERSION": "" + if FLAGS.version is None or FLAGS.version not in TRITON_VERSION_MAP + else TRITON_VERSION_MAP[FLAGS.version][6], } # For CPU-only image we need to copy some cuda libraries and dependencies - # since we are using PyTorch, TensorFlow 1, TensorFlow 2 containers that + # since we are using PyTorch and TensorFlow containers that # are not CPU-only. - if not FLAGS.enable_gpu and ( - ('pytorch' in backends) or ('tensorflow1' in backends) or - ('tensorflow2' in backends)) and (target_platform() != 'windows'): - if 'gpu-base' in images: - gpu_base_image = images['gpu-base'] + if ( + not FLAGS.enable_gpu + and (("pytorch" in backends) or ("tensorflow" in backends)) + and (target_platform() != "windows") + ): + if "gpu-base" in images: + gpu_base_image = images["gpu-base"] else: - gpu_base_image = 'nvcr.io/nvidia/tritonserver:{}-py3-min'.format( - FLAGS.upstream_container_version) - dockerfileargmap['GPU_BASE_IMAGE'] = gpu_base_image + gpu_base_image = "nvcr.io/nvidia/tritonserver:{}-py3-min".format( + FLAGS.upstream_container_version + ) + dockerfileargmap["GPU_BASE_IMAGE"] = gpu_base_image - create_dockerfile_buildbase(FLAGS.build_dir, 'Dockerfile.buildbase', - dockerfileargmap) + create_dockerfile_buildbase( + FLAGS.build_dir, "Dockerfile.buildbase", dockerfileargmap + ) - if target_platform() == 'windows': - create_dockerfile_windows(FLAGS.build_dir, 'Dockerfile', - dockerfileargmap, backends, repoagents) + if target_platform() == "windows": + create_dockerfile_windows( + FLAGS.build_dir, + "Dockerfile", + dockerfileargmap, + backends, + repoagents, + caches, + ) else: - create_dockerfile_linux(FLAGS.build_dir, 'Dockerfile', dockerfileargmap, - backends, repoagents, endpoints) + create_dockerfile_linux( + FLAGS.build_dir, + "Dockerfile", + dockerfileargmap, + backends, + repoagents, + caches, + endpoints, + ) # Dockerfile used for the creating the CI base image. - create_dockerfile_cibase(FLAGS.build_dir, 'Dockerfile.cibase', - dockerfileargmap) + create_dockerfile_cibase(FLAGS.build_dir, "Dockerfile.cibase", dockerfileargmap) -def create_docker_build_script(script_name, container_install_dir, - container_ci_dir): +def create_docker_build_script(script_name, container_install_dir, container_ci_dir): with BuildScript( - os.path.join(FLAGS.build_dir, script_name), - verbose=FLAGS.verbose, - desc=('Docker-based build script for Triton Inference Server' - )) as docker_script: - + os.path.join(FLAGS.build_dir, script_name), + verbose=FLAGS.verbose, + desc=("Docker-based build script for Triton Inference Server"), + ) as docker_script: # # Build base image... tritonserver_buildbase # docker_script.commentln(8) - docker_script.comment('Create Triton base build image') + docker_script.comment("Create Triton base build image") docker_script.comment( - 'This image contains all dependencies necessary to build Triton') + "This image contains all dependencies necessary to build Triton" + ) docker_script.comment() cachefrommap = [ - 'tritonserver_buildbase', 'tritonserver_buildbase_cache0', - 'tritonserver_buildbase_cache1' + "tritonserver_buildbase", + "tritonserver_buildbase_cache0", + "tritonserver_buildbase_cache1", ] baseargs = [ - 'docker', 'build', '-t', 'tritonserver_buildbase', '-f', - os.path.join(FLAGS.build_dir, 'Dockerfile.buildbase') + "docker", + "build", + "-t", + "tritonserver_buildbase", + "-f", + os.path.join(FLAGS.build_dir, "Dockerfile.buildbase"), ] if not FLAGS.no_container_pull: baseargs += [ - '--pull', + "--pull", ] # Windows docker runs in a VM and memory needs to be specified # explicitly (at least for some configurations of docker). - if target_platform() == 'windows': + if target_platform() == "windows": if FLAGS.container_memory: - baseargs += ['--memory', FLAGS.container_memory] + baseargs += ["--memory", FLAGS.container_memory] - baseargs += ['--cache-from={}'.format(k) for k in cachefrommap] - baseargs += ['.'] + baseargs += ["--cache-from={}".format(k) for k in cachefrommap] + baseargs += ["."] docker_script.cwd(THIS_SCRIPT_DIR) docker_script.cmd(baseargs, check_exitcode=True) @@ -1373,10 +1545,9 @@ def create_docker_build_script(script_name, container_install_dir, # docker_script.blankln() docker_script.commentln(8) - docker_script.comment('Run build in tritonserver_buildbase container') - docker_script.comment( - 'Mount a directory into the container where the install') - docker_script.comment('artifacts will be placed.') + docker_script.comment("Run build in tritonserver_buildbase container") + docker_script.comment("Mount a directory into the container where the install") + docker_script.comment("artifacts will be placed.") docker_script.comment() # Don't use '-v' to communicate the built artifacts out of the @@ -1384,63 +1555,76 @@ def create_docker_build_script(script_name, container_install_dir, # Docker (i.e. docker-in-docker) and not just if run directly # from host. runargs = [ - 'docker', 'run', '-w', '/workspace/build', '--name', - 'tritonserver_builder' + "docker", + "run", + "-w", + "/workspace/build", + "--name", + "tritonserver_builder", ] if not FLAGS.no_container_interactive: - runargs += ['-it'] + runargs += ["-it"] - if target_platform() == 'windows': + if target_platform() == "windows": if FLAGS.container_memory: - runargs += ['--memory', FLAGS.container_memory] - runargs += [ - '-v', '\\\\.\pipe\docker_engine:\\\\.\pipe\docker_engine' - ] + runargs += ["--memory", FLAGS.container_memory] + runargs += ["-v", "\\\\.\pipe\docker_engine:\\\\.\pipe\docker_engine"] else: - runargs += ['-v', '/var/run/docker.sock:/var/run/docker.sock'] + runargs += ["-v", "/var/run/docker.sock:/var/run/docker.sock"] - runargs += ['tritonserver_buildbase'] + runargs += ["tritonserver_buildbase"] - if target_platform() == 'windows': - runargs += [ - 'powershell.exe', '-noexit', '-File', './cmake_build.ps1' - ] + if target_platform() == "windows": + runargs += ["powershell.exe", "-noexit", "-File", "./cmake_build.ps1"] else: - runargs += ['./cmake_build'] + runargs += ["./cmake_build"] # Remove existing tritonserver_builder container... - if target_platform() == 'windows': - docker_script.cmd(['docker', 'rm', 'tritonserver_builder']) + if target_platform() == "windows": + docker_script.cmd(["docker", "rm", "tritonserver_builder"]) else: docker_script._file.write( - 'if [ "$(docker ps -a | grep tritonserver_builder)" ]; then docker rm tritonserver_builder; fi\n' + 'if [ "$(docker ps -a | grep tritonserver_builder)" ]; then docker rm -f tritonserver_builder; fi\n' ) docker_script.cmd(runargs, check_exitcode=True) - docker_script.cmd([ - 'docker', 'cp', 'tritonserver_builder:/tmp/tritonbuild/install', - FLAGS.build_dir - ], - check_exitcode=True) - docker_script.cmd([ - 'docker', 'cp', 'tritonserver_builder:/tmp/tritonbuild/ci', - FLAGS.build_dir - ], - check_exitcode=True) + docker_script.cmd( + [ + "docker", + "cp", + "tritonserver_builder:/tmp/tritonbuild/install", + FLAGS.build_dir, + ], + check_exitcode=True, + ) + docker_script.cmd( + [ + "docker", + "cp", + "tritonserver_builder:/tmp/tritonbuild/ci", + FLAGS.build_dir, + ], + check_exitcode=True, + ) # # Final image... tritonserver # docker_script.blankln() docker_script.commentln(8) - docker_script.comment('Create final tritonserver image') + docker_script.comment("Create final tritonserver image") docker_script.comment() finalargs = [ - 'docker', 'build', '-t', 'tritonserver', '-f', - os.path.join(FLAGS.build_dir, 'Dockerfile'), '.' + "docker", + "build", + "-t", + "tritonserver", + "-f", + os.path.join(FLAGS.build_dir, "Dockerfile"), + ".", ] docker_script.cwd(THIS_SCRIPT_DIR) @@ -1451,266 +1635,413 @@ def create_docker_build_script(script_name, container_install_dir, # docker_script.blankln() docker_script.commentln(8) - docker_script.comment('Create CI base image') + docker_script.comment("Create CI base image") docker_script.comment() cibaseargs = [ - 'docker', 'build', '-t', 'tritonserver_cibase', '-f', - os.path.join(FLAGS.build_dir, 'Dockerfile.cibase'), '.' + "docker", + "build", + "-t", + "tritonserver_cibase", + "-f", + os.path.join(FLAGS.build_dir, "Dockerfile.cibase"), + ".", ] docker_script.cwd(THIS_SCRIPT_DIR) docker_script.cmd(cibaseargs, check_exitcode=True) -def core_build(cmake_script, repo_dir, cmake_dir, build_dir, install_dir, - components, backends): - repo_build_dir = os.path.join(build_dir, 'tritonserver', 'build') - repo_install_dir = os.path.join(build_dir, 'tritonserver', 'install') +def core_build( + cmake_script, repo_dir, cmake_dir, build_dir, install_dir, components, backends +): + repo_build_dir = os.path.join(build_dir, "tritonserver", "build") + repo_install_dir = os.path.join(build_dir, "tritonserver", "install") cmake_script.commentln(8) - cmake_script.comment('Triton core library and tritonserver executable') + cmake_script.comment("Triton core library and tritonserver executable") cmake_script.comment() cmake_script.mkdir(repo_build_dir) cmake_script.cwd(repo_build_dir) cmake_script.cmake( - core_cmake_args(components, backends, cmake_dir, repo_install_dir)) + core_cmake_args(components, backends, cmake_dir, repo_install_dir) + ) cmake_script.makeinstall() - if target_platform() == 'windows': - cmake_script.mkdir(os.path.join(install_dir, 'bin')) + if target_platform() == "windows": + cmake_script.mkdir(os.path.join(install_dir, "bin")) cmake_script.cp( - os.path.join(repo_install_dir, 'bin', 'tritonserver.exe'), - os.path.join(install_dir, 'bin')) + os.path.join(repo_install_dir, "bin", "tritonserver.exe"), + os.path.join(install_dir, "bin"), + ) cmake_script.cp( - os.path.join(repo_install_dir, 'bin', 'tritonserver.dll'), - os.path.join(install_dir, 'bin')) + os.path.join(repo_install_dir, "bin", "tritonserver.dll"), + os.path.join(install_dir, "bin"), + ) else: - cmake_script.mkdir(os.path.join(install_dir, 'bin')) - cmake_script.cp(os.path.join(repo_install_dir, 'bin', 'tritonserver'), - os.path.join(install_dir, 'bin')) - cmake_script.mkdir(os.path.join(install_dir, 'lib')) + cmake_script.mkdir(os.path.join(install_dir, "bin")) + cmake_script.cp( + os.path.join(repo_install_dir, "bin", "tritonserver"), + os.path.join(install_dir, "bin"), + ) + cmake_script.mkdir(os.path.join(install_dir, "lib")) cmake_script.cp( - os.path.join(repo_install_dir, 'lib', 'libtritonserver.so'), - os.path.join(install_dir, 'lib')) + os.path.join(repo_install_dir, "lib", "libtritonserver.so"), + os.path.join(install_dir, "lib"), + ) + # [FIXME] Placing the Triton server wheel file in 'python' for now, should + # have been upload to pip registry and be able to install directly + cmake_script.mkdir(os.path.join(install_dir, "python")) + cmake_script.cp( + os.path.join(repo_install_dir, "python", "tritonserver*.whl"), + os.path.join(install_dir, "python"), + ) - cmake_script.mkdir(os.path.join(install_dir, 'include', 'triton')) + cmake_script.mkdir(os.path.join(install_dir, "include", "triton")) cmake_script.cpdir( - os.path.join(repo_install_dir, 'include', 'triton', 'core'), - os.path.join(install_dir, 'include', 'triton', 'core')) + os.path.join(repo_install_dir, "include", "triton", "core"), + os.path.join(install_dir, "include", "triton", "core"), + ) - cmake_script.cp(os.path.join(repo_dir, 'LICENSE'), install_dir) - cmake_script.cp(os.path.join(repo_dir, 'TRITON_VERSION'), install_dir) + cmake_script.cp(os.path.join(repo_dir, "LICENSE"), install_dir) + cmake_script.cp(os.path.join(repo_dir, "TRITON_VERSION"), install_dir) # If requested, package the source code for all OSS used to build # For windows, Triton is not delivered as a container so skip for # windows platform. - if target_platform() != 'windows': - if (not FLAGS.no_container_build) and (not FLAGS.no_core_build) and ( - not FLAGS.no_container_source): - cmake_script.mkdir(os.path.join(install_dir, 'third-party-src')) + if target_platform() != "windows": + if ( + (not FLAGS.no_container_build) + and (not FLAGS.no_core_build) + and (not FLAGS.no_container_source) + ): + cmake_script.mkdir(os.path.join(install_dir, "third-party-src")) cmake_script.cwd(repo_build_dir) cmake_script.tar( - 'third-party-src', - os.path.join(install_dir, 'third-party-src', 'src.tar.gz')) + "third-party-src", + os.path.join(install_dir, "third-party-src", "src.tar.gz"), + ) cmake_script.cp( - os.path.join(repo_dir, 'docker', 'README.third-party-src'), - os.path.join(install_dir, 'third-party-src', 'README')) + os.path.join(repo_dir, "docker", "README.third-party-src"), + os.path.join(install_dir, "third-party-src", "README"), + ) cmake_script.comment() - cmake_script.comment('end Triton core library and tritonserver executable') + cmake_script.comment("end Triton core library and tritonserver executable") cmake_script.commentln(8) cmake_script.blankln() -def backend_build(be, - cmake_script, - tag, - build_dir, - install_dir, - github_organization, - images, - components, - library_paths, - variant_index=0): - repo_build_dir = os.path.join(build_dir, be, 'build') - repo_install_dir = os.path.join(build_dir, be, 'install') +def tensorrtllm_prebuild(cmake_script): + # Export the TRT_ROOT environment variable + cmake_script.cmd("export TRT_ROOT=/usr/local/tensorrt") + cmake_script.cmd("export ARCH=$(uname -m)") + + # FIXME: Update the file structure to the one Triton expects. This is a temporary fix + # to get the build working for r23.10. + cmake_script.cmd("mv tensorrtllm/inflight_batcher_llm/src tensorrtllm") + cmake_script.cmd("mv tensorrtllm/inflight_batcher_llm/cmake tensorrtllm") + cmake_script.cmd("mv tensorrtllm/inflight_batcher_llm/CMakeLists.txt tensorrtllm") + + +def backend_build( + be, + cmake_script, + tag, + build_dir, + install_dir, + github_organization, + images, + components, + library_paths, +): + repo_build_dir = os.path.join(build_dir, be, "build") + repo_install_dir = os.path.join(build_dir, be, "install") cmake_script.commentln(8) - cmake_script.comment(f'\'{be}\' backend') - cmake_script.comment('Delete this section to remove backend from build') + cmake_script.comment(f"'{be}' backend") + cmake_script.comment("Delete this section to remove backend from build") cmake_script.comment() cmake_script.mkdir(build_dir) cmake_script.cwd(build_dir) cmake_script.gitclone(backend_repo(be), tag, be, github_organization) + if be == "tensorrtllm": + tensorrtllm_prebuild(cmake_script) + cmake_script.mkdir(repo_build_dir) cmake_script.cwd(repo_build_dir) cmake_script.cmake( - backend_cmake_args(images, components, be, repo_install_dir, - library_paths, variant_index)) + backend_cmake_args(images, components, be, repo_install_dir, library_paths) + ) cmake_script.makeinstall() - cmake_script.mkdir(os.path.join(install_dir, 'backends')) - cmake_script.rmdir(os.path.join(install_dir, 'backends', be)) - cmake_script.cpdir(os.path.join(repo_install_dir, 'backends', be), - os.path.join(install_dir, 'backends')) + cmake_script.mkdir(os.path.join(install_dir, "backends")) + cmake_script.rmdir(os.path.join(install_dir, "backends", be)) + + cmake_script.cpdir( + os.path.join(repo_install_dir, "backends", be), + os.path.join(install_dir, "backends"), + ) cmake_script.comment() - cmake_script.comment(f'end \'{be}\' backend') + cmake_script.comment(f"end '{be}' backend") cmake_script.commentln(8) cmake_script.blankln() -def repo_agent_build(ra, cmake_script, build_dir, install_dir, repoagent_repo, - repoagents): - repo_build_dir = os.path.join(build_dir, ra, 'build') - repo_install_dir = os.path.join(build_dir, ra, 'install') +def backend_clone( + be, + clone_script, + tag, + build_dir, + install_dir, + github_organization, +): + clone_script.commentln(8) + clone_script.comment(f"'{be}' backend") + clone_script.comment("Delete this section to remove backend from build") + clone_script.comment() + clone_script.mkdir(build_dir) + clone_script.cwd(build_dir) + clone_script.gitclone(backend_repo(be), tag, be, github_organization) + + repo_target_dir = os.path.join(install_dir, "backends") + clone_script.mkdir(repo_target_dir) + backend_dir = os.path.join(repo_target_dir, be) + clone_script.rmdir(backend_dir) + clone_script.mkdir(backend_dir) + + clone_script.cp( + os.path.join(build_dir, be, "src", "model.py"), + backend_dir, + ) + + clone_script.comment() + clone_script.comment(f"end '{be}' backend") + clone_script.commentln(8) + clone_script.blankln() + + +def repo_agent_build( + ra, cmake_script, build_dir, install_dir, repoagent_repo, repoagents +): + repo_build_dir = os.path.join(build_dir, ra, "build") + repo_install_dir = os.path.join(build_dir, ra, "install") cmake_script.commentln(8) - cmake_script.comment(f'\'{ra}\' repository agent') - cmake_script.comment( - 'Delete this section to remove repository agent from build') + cmake_script.comment(f"'{ra}' repository agent") + cmake_script.comment("Delete this section to remove repository agent from build") cmake_script.comment() cmake_script.mkdir(build_dir) cmake_script.cwd(build_dir) - cmake_script.gitclone(repoagent_repo(ra), repoagents[ra], ra, - FLAGS.github_organization) + cmake_script.gitclone( + repoagent_repo(ra), repoagents[ra], ra, FLAGS.github_organization + ) cmake_script.mkdir(repo_build_dir) cmake_script.cwd(repo_build_dir) - cmake_script.cmake( - repoagent_cmake_args(images, components, ra, repo_install_dir)) + cmake_script.cmake(repoagent_cmake_args(images, components, ra, repo_install_dir)) cmake_script.makeinstall() - cmake_script.mkdir(os.path.join(install_dir, 'repoagents')) - cmake_script.rmdir(os.path.join(install_dir, 'repoagents', ra)) - cmake_script.cpdir(os.path.join(repo_install_dir, 'repoagents', ra), - os.path.join(install_dir, 'repoagents')) + cmake_script.mkdir(os.path.join(install_dir, "repoagents")) + cmake_script.rmdir(os.path.join(install_dir, "repoagents", ra)) + cmake_script.cpdir( + os.path.join(repo_install_dir, "repoagents", ra), + os.path.join(install_dir, "repoagents"), + ) cmake_script.comment() - cmake_script.comment(f'end \'{ra}\' repository agent') + cmake_script.comment(f"end '{ra}' repository agent") cmake_script.commentln(8) cmake_script.blankln() -def cibase_build(cmake_script, repo_dir, cmake_dir, build_dir, install_dir, - ci_dir, backends): - repo_build_dir = os.path.join(build_dir, 'tritonserver', 'build') - repo_install_dir = os.path.join(build_dir, 'tritonserver', 'install') +def cache_build(cache, cmake_script, build_dir, install_dir, cache_repo, caches): + repo_build_dir = os.path.join(build_dir, cache, "build") + repo_install_dir = os.path.join(build_dir, cache, "install") cmake_script.commentln(8) - cmake_script.comment('Collect Triton CI artifacts') + cmake_script.comment(f"'{cache}' cache") + cmake_script.comment("Delete this section to remove cache from build") + cmake_script.comment() + cmake_script.mkdir(build_dir) + cmake_script.cwd(build_dir) + cmake_script.gitclone( + cache_repo(cache), caches[cache], cache, FLAGS.github_organization + ) + + cmake_script.mkdir(repo_build_dir) + cmake_script.cwd(repo_build_dir) + cmake_script.cmake(cache_cmake_args(images, components, cache, repo_install_dir)) + cmake_script.makeinstall() + + cmake_script.mkdir(os.path.join(install_dir, "caches")) + cmake_script.rmdir(os.path.join(install_dir, "caches", cache)) + cmake_script.cpdir( + os.path.join(repo_install_dir, "caches", cache), + os.path.join(install_dir, "caches"), + ) + cmake_script.comment() + cmake_script.comment(f"end '{cache}' cache") + cmake_script.commentln(8) + cmake_script.blankln() + + +def cibase_build( + cmake_script, repo_dir, cmake_dir, build_dir, install_dir, ci_dir, backends +): + repo_install_dir = os.path.join(build_dir, "tritonserver", "install") + + cmake_script.commentln(8) + cmake_script.comment("Collect Triton CI artifacts") cmake_script.comment() cmake_script.mkdir(ci_dir) # On windows we are not yet using a CI/QA docker image for # testing, so don't do anything... - if target_platform() == 'windows': + if target_platform() == "windows": return # The core build produces some artifacts that are needed for CI # testing, so include those in the install. - cmake_script.cpdir(os.path.join(repo_dir, 'qa'), ci_dir) - cmake_script.cpdir(os.path.join(repo_dir, 'deploy'), ci_dir) - cmake_script.mkdir(os.path.join(ci_dir, 'docs')) - cmake_script.cpdir(os.path.join(repo_dir, 'docs', 'examples'), - os.path.join(ci_dir, 'docs')) - cmake_script.mkdir(os.path.join(ci_dir, 'src', 'test')) - cmake_script.cpdir(os.path.join(repo_dir, 'src', 'test', 'models'), - os.path.join(ci_dir, 'src', 'test')) - cmake_script.cpdir(os.path.join(repo_install_dir, 'bin'), ci_dir) - cmake_script.mkdir(os.path.join(ci_dir, 'lib')) - cmake_script.cp( - os.path.join(repo_install_dir, 'lib', - 'libtritonrepoagent_relocation.so'), - os.path.join(ci_dir, 'lib')) + cmake_script.cpdir(os.path.join(repo_dir, "qa"), ci_dir) + cmake_script.cpdir(os.path.join(repo_dir, "deploy"), ci_dir) + cmake_script.mkdir(os.path.join(ci_dir, "docs")) + cmake_script.cpdir( + os.path.join(repo_dir, "docs", "examples"), os.path.join(ci_dir, "docs") + ) + cmake_script.mkdir(os.path.join(ci_dir, "src", "test")) + cmake_script.cpdir( + os.path.join(repo_dir, "src", "test", "models"), + os.path.join(ci_dir, "src", "test"), + ) + # Skip copying the artifacts in the bin, lib, and python as those directories will + # be missing when the core build is not enabled. + if not FLAGS.no_core_build: + cmake_script.cpdir(os.path.join(repo_install_dir, "bin"), ci_dir) + cmake_script.mkdir(os.path.join(ci_dir, "lib")) + cmake_script.cp( + os.path.join(repo_install_dir, "lib", "libtritonrepoagent_relocation.so"), + os.path.join(ci_dir, "lib"), + ) + cmake_script.cpdir(os.path.join(repo_install_dir, "python"), ci_dir) # Some of the backends are needed for CI testing - cmake_script.mkdir(os.path.join(ci_dir, 'backends')) - for be in ('identity', 'repeat', 'square'): - be_install_dir = os.path.join(build_dir, be, 'install', 'backends', be) - if target_platform() == 'windows': - cmake_script.cmd(f'if (Test-Path -Path {be_install_dir}) {{') + cmake_script.mkdir(os.path.join(ci_dir, "backends")) + for be in ("identity", "repeat", "square"): + be_install_dir = os.path.join(build_dir, be, "install", "backends", be) + if target_platform() == "windows": + cmake_script.cmd(f"if (Test-Path -Path {be_install_dir}) {{") else: - cmake_script.cmd(f'if [[ -e {be_install_dir} ]]; then') - cmake_script.cpdir(be_install_dir, os.path.join(ci_dir, 'backends')) - cmake_script.cmd('}' if target_platform() == 'windows' else 'fi') + cmake_script.cmd(f"if [[ -e {be_install_dir} ]]; then") + cmake_script.cpdir(be_install_dir, os.path.join(ci_dir, "backends")) + cmake_script.cmd("}" if target_platform() == "windows" else "fi") # Some of the unit-test built backends are needed for CI testing - cmake_script.mkdir( - os.path.join(ci_dir, 'tritonbuild', 'tritonserver', 'backends')) - for be in ('query', 'implicit_state', 'sequence', 'dyna_sequence', - 'distributed_addsub'): - be_install_dir = os.path.join(repo_install_dir, 'backends', be) - if target_platform() == 'windows': - cmake_script.cmd(f'if (Test-Path -Path {be_install_dir}) {{') + cmake_script.mkdir(os.path.join(ci_dir, "tritonbuild", "tritonserver", "backends")) + for be in ( + "query", + "implicit_state", + "sequence", + "dyna_sequence", + "distributed_addsub", + "iterative_sequence", + ): + be_install_dir = os.path.join(repo_install_dir, "backends", be) + if target_platform() == "windows": + cmake_script.cmd(f"if (Test-Path -Path {be_install_dir}) {{") else: - cmake_script.cmd(f'if [[ -e {be_install_dir} ]]; then') + cmake_script.cmd(f"if [[ -e {be_install_dir} ]]; then") cmake_script.cpdir( be_install_dir, - os.path.join(ci_dir, 'tritonbuild', 'tritonserver', 'backends')) - cmake_script.cmd('}' if target_platform() == 'windows' else 'fi') + os.path.join(ci_dir, "tritonbuild", "tritonserver", "backends"), + ) + cmake_script.cmd("}" if target_platform() == "windows" else "fi") # The onnxruntime_backend build produces some artifacts that # are needed for CI testing. - if 'onnxruntime' in backends: - ort_install_dir = os.path.join(build_dir, 'onnxruntime', 'install') - cmake_script.mkdir(os.path.join(ci_dir, 'qa', 'L0_custom_ops')) - cmake_script.cp( - os.path.join(ort_install_dir, 'test', 'libcustom_op_library.so'), - os.path.join(ci_dir, 'qa', 'L0_custom_ops')) - cmake_script.cp( - os.path.join(ort_install_dir, 'test', 'custom_op_test.onnx'), - os.path.join(ci_dir, 'qa', 'L0_custom_ops')) + if "onnxruntime" in backends: + ort_install_dir = os.path.join(build_dir, "onnxruntime", "install") + cmake_script.mkdir(os.path.join(ci_dir, "qa", "L0_custom_ops")) + if target_platform() != "igpu": + cmake_script.cp( + os.path.join(ort_install_dir, "test", "libcustom_op_library.so"), + os.path.join(ci_dir, "qa", "L0_custom_ops"), + ) + cmake_script.cp( + os.path.join(ort_install_dir, "test", "custom_op_test.onnx"), + os.path.join(ci_dir, "qa", "L0_custom_ops"), + ) + # [WIP] other way than wildcard? + backend_tests = os.path.join(build_dir, "onnxruntime", "test", "*") + cmake_script.cpdir(backend_tests, os.path.join(ci_dir, "qa")) # Need the build area for some backends so that they can be # rebuilt with specific options. - cmake_script.mkdir(os.path.join(ci_dir, 'tritonbuild')) - for be in ('identity', 'python'): + cmake_script.mkdir(os.path.join(ci_dir, "tritonbuild")) + for be in ("identity", "python"): if be in backends: - cmake_script.rmdir(os.path.join(build_dir, be, 'build')) - cmake_script.rmdir(os.path.join(build_dir, be, 'install')) - cmake_script.cpdir(os.path.join(build_dir, be), - os.path.join(ci_dir, 'tritonbuild')) + cmake_script.rmdir(os.path.join(build_dir, be, "build")) + cmake_script.rmdir(os.path.join(build_dir, be, "install")) + cmake_script.cpdir( + os.path.join(build_dir, be), os.path.join(ci_dir, "tritonbuild") + ) cmake_script.comment() - cmake_script.comment('end Triton CI artifacts') + cmake_script.comment("end Triton CI artifacts") cmake_script.commentln(8) cmake_script.blankln() def finalize_build(cmake_script, install_dir, ci_dir): - cmake_script.cmd(f'chmod -R a+rw {install_dir}') - cmake_script.cmd(f'chmod -R a+rw {ci_dir}') + cmake_script.cmd(f"chmod -R a+rw {install_dir}") + cmake_script.cmd(f"chmod -R a+rw {ci_dir}") def enable_all(): - if target_platform() != 'windows': + if target_platform() != "windows": all_backends = [ - 'ensemble', 'identity', 'square', 'repeat', 'tensorflow1', - 'tensorflow2', 'onnxruntime', 'python', 'dali', 'pytorch', - 'openvino', 'fil', 'tensorrt' + "ensemble", + "identity", + "square", + "repeat", + "tensorflow", + "onnxruntime", + "python", + "dali", + "pytorch", + "openvino", + "fil", + "tensorrt", ] - all_repoagents = ['checksum'] - all_filesystems = ['gcs', 's3', 'azure_storage'] - all_endpoints = ['http', 'grpc', 'sagemaker', 'vertex-ai'] + all_repoagents = ["checksum"] + all_caches = ["local", "redis"] + all_filesystems = ["gcs", "s3", "azure_storage"] + all_endpoints = ["http", "grpc", "sagemaker", "vertex-ai"] FLAGS.enable_logging = True FLAGS.enable_stats = True FLAGS.enable_metrics = True FLAGS.enable_gpu_metrics = True + FLAGS.enable_cpu_metrics = True FLAGS.enable_tracing = True FLAGS.enable_nvtx = True FLAGS.enable_gpu = True else: all_backends = [ - 'ensemble', 'identity', 'square', 'repeat', 'onnxruntime', - 'openvino', 'tensorrt' + "ensemble", + "identity", + "square", + "repeat", + "onnxruntime", + "openvino", + "tensorrt", ] - all_repoagents = ['checksum'] + all_repoagents = ["checksum"] + all_caches = ["local", "redis"] all_filesystems = [] - all_endpoints = ['http', 'grpc'] + all_endpoints = ["http", "grpc"] FLAGS.enable_logging = True FLAGS.enable_stats = True @@ -1719,7 +2050,7 @@ def enable_all(): requested_backends = [] for be in FLAGS.backend: - parts = be.split(':') + parts = be.split(":") requested_backends += [parts[0]] for be in all_backends: if be not in requested_backends: @@ -1727,12 +2058,20 @@ def enable_all(): requested_repoagents = [] for ra in FLAGS.repoagent: - parts = ra.split(':') + parts = ra.split(":") requested_repoagents += [parts[0]] for ra in all_repoagents: if ra not in requested_repoagents: FLAGS.repoagent += [ra] + requested_caches = [] + for cache in FLAGS.cache: + parts = cache.split(":") + requested_caches += [parts[0]] + for cache in all_caches: + if cache not in requested_caches: + FLAGS.cache += [cache] + for fs in all_filesystems: if fs not in FLAGS.filesystem: FLAGS.filesystem += [fs] @@ -1742,294 +2081,296 @@ def enable_all(): FLAGS.endpoint += [ep] -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() group_qv = parser.add_mutually_exclusive_group() - group_qv.add_argument('-q', - '--quiet', - action="store_true", - required=False, - help='Disable console output.') - group_qv.add_argument('-v', - '--verbose', - action="store_true", - required=False, - help='Enable verbose output.') + group_qv.add_argument( + "-q", + "--quiet", + action="store_true", + required=False, + help="Disable console output.", + ) + group_qv.add_argument( + "-v", + "--verbose", + action="store_true", + required=False, + help="Enable verbose output.", + ) parser.add_argument( - '--dryrun', + "--dryrun", + action="store_true", + required=False, + help="Output the build scripts, but do not perform build.", + ) + parser.add_argument( + "--no-container-build", action="store_true", required=False, - help='Output the build scripts, but do not perform build.') - parser.add_argument('--no-container-build', - action="store_true", - required=False, - help='Do not use Docker container for build.') + help="Do not use Docker container for build.", + ) parser.add_argument( - '--no-container-interactive', + "--no-container-interactive", action="store_true", required=False, - help= - 'Do not use -it argument to "docker run" when performing container build.' + help='Do not use -it argument to "docker run" when performing container build.', ) parser.add_argument( - '--no-container-pull', + "--no-container-pull", action="store_true", required=False, - help='Do not use Docker --pull argument when building container.') + help="Do not use Docker --pull argument when building container.", + ) parser.add_argument( - '--container-memory', + "--container-memory", default=None, required=False, - help='Value for Docker --memory argument. Used only for windows builds.' + help="Value for Docker --memory argument. Used only for windows builds.", ) parser.add_argument( - '--target-platform', + "--target-platform", required=False, default=None, - help= - 'Target platform for build, can be "linux", "windows" or "jetpack". If not specified, build targets the current platform.' + help='Target platform for build, can be "linux", "windows" or "igpu". If not specified, build targets the current platform.', ) parser.add_argument( - '--target-machine', + "--target-machine", required=False, default=None, - help= - 'Target machine/architecture for build. If not specified, build targets the current machine/architecture.' - ) - - parser.add_argument('--build-id', - type=str, - required=False, - help='Build ID associated with the build.') - parser.add_argument('--build-sha', - type=str, - required=False, - help='SHA associated with the build.') + help="Target machine/architecture for build. If not specified, build targets the current machine/architecture.", + ) + + parser.add_argument( + "--build-id", + type=str, + required=False, + help="Build ID associated with the build.", + ) + parser.add_argument( + "--build-sha", type=str, required=False, help="SHA associated with the build." + ) parser.add_argument( - '--build-dir', + "--build-dir", type=str, required=False, - help= - 'Build directory. All repo clones and builds will be performed in this directory.' + help="Build directory. All repo clones and builds will be performed in this directory.", ) parser.add_argument( - '--install-dir', + "--install-dir", type=str, required=False, default=None, - help='Install directory, default is /opt/tritonserver.') + help="Install directory, default is /opt/tritonserver.", + ) parser.add_argument( - '--cmake-dir', + "--cmake-dir", type=str, required=False, - help='Directory containing the CMakeLists.txt file for Triton server.') + help="Directory containing the CMakeLists.txt file for Triton server.", + ) parser.add_argument( - '--tmp-dir', + "--tmp-dir", type=str, required=False, - default='/tmp', - help= - 'Temporary directory used for building inside docker. Default is /tmp.') + default="/tmp", + help="Temporary directory used for building inside docker. Default is /tmp.", + ) parser.add_argument( - '--library-paths', - action='append', + "--library-paths", + action="append", required=False, default=None, - help= - 'Specify library paths for respective backends in build as [:].' + help="Specify library paths for respective backends in build as [:].", ) parser.add_argument( - '--build-type', + "--build-type", required=False, - default='Release', - help= - 'Build type, one of "Release", "Debug", "RelWithDebInfo" or "MinSizeRel". Default is "Release".' + default="Release", + help='Build type, one of "Release", "Debug", "RelWithDebInfo" or "MinSizeRel". Default is "Release".', ) parser.add_argument( - '-j', - '--build-parallel', + "-j", + "--build-parallel", type=int, required=False, default=None, - help='Build parallelism. Defaults to 2 * number-of-cores.') + help="Build parallelism. Defaults to 2 * number-of-cores.", + ) parser.add_argument( - '--github-organization', + "--github-organization", type=str, required=False, - default='https://github.com/triton-inference-server', - help= - 'The GitHub organization containing the repos used for the build. Defaults to "https://github.com/triton-inference-server".' + default="https://github.com/triton-inference-server", + help='The GitHub organization containing the repos used for the build. Defaults to "https://github.com/triton-inference-server".', ) parser.add_argument( - '--version', + "--version", type=str, required=False, - help= - 'The Triton version. If not specified defaults to the value in the TRITON_VERSION file.' + help="The Triton version. If not specified defaults to the value in the TRITON_VERSION file.", ) parser.add_argument( - '--container-version', + "--container-version", type=str, required=False, - help= - 'The Triton container version to build. If not specified the container version will be chosen automatically based on --version value.' + help="The Triton container version to build. If not specified the container version will be chosen automatically based on --version value.", ) parser.add_argument( - '--upstream-container-version', + "--upstream-container-version", type=str, required=False, - help= - 'The upstream container version to use for the build. If not specified the upstream container version will be chosen automatically based on --version value.' + help="The upstream container version to use for the build. If not specified the upstream container version will be chosen automatically based on --version value.", ) parser.add_argument( - '--container-prebuild-command', + "--container-prebuild-command", type=str, required=False, - help= - 'When performing a container build, this command will be executed within the container just before the build it performed.' + help="When performing a container build, this command will be executed within the container just before the build it performed.", ) parser.add_argument( - '--no-container-source', + "--no-container-source", action="store_true", required=False, - help='Do not include OSS source code in Docker container.') + help="Do not include OSS source code in Docker container.", + ) parser.add_argument( - '--image', - action='append', + "--image", + action="append", required=False, - help= - 'Use specified Docker image in build as ,. can be "base", "gpu-base", "tensorflow1", "tensorflow2", or "pytorch".' + help='Use specified Docker image in build as ,. can be "base", "gpu-base", "tensorflow", or "pytorch".', ) parser.add_argument( - '--enable-all', + "--enable-all", + action="store_true", + required=False, + help="Enable all standard released Triton features, backends, repository agents, caches, endpoints and file systems.", + ) + parser.add_argument( + "--enable-logging", action="store_true", required=False, help="Enable logging." + ) + parser.add_argument( + "--enable-stats", action="store_true", required=False, - help= - 'Enable all standard released Triton features, backends, repository agents, endpoints and file systems.' - ) - parser.add_argument('--enable-logging', - action="store_true", - required=False, - help='Enable logging.') - parser.add_argument('--enable-stats', - action="store_true", - required=False, - help='Enable statistics collection.') - parser.add_argument('--enable-metrics', - action="store_true", - required=False, - help='Enable metrics reporting.') - parser.add_argument('--enable-gpu-metrics', - action="store_true", - required=False, - help='Include GPU metrics in reported metrics.') - parser.add_argument('--enable-tracing', - action="store_true", - required=False, - help='Enable tracing.') - parser.add_argument('--enable-nvtx', - action="store_true", - required=False, - help='Enable NVTX.') - parser.add_argument('--enable-gpu', - action="store_true", - required=False, - help='Enable GPU support.') - parser.add_argument('--enable-mali-gpu', - action="store_true", - required=False, - help='Enable ARM MALI GPU support.') + help="Enable statistics collection.", + ) parser.add_argument( - '--min-compute-capability', + "--enable-metrics", + action="store_true", + required=False, + help="Enable metrics reporting.", + ) + parser.add_argument( + "--enable-gpu-metrics", + action="store_true", + required=False, + help="Include GPU metrics in reported metrics.", + ) + parser.add_argument( + "--enable-cpu-metrics", + action="store_true", + required=False, + help="Include CPU metrics in reported metrics.", + ) + parser.add_argument( + "--enable-tracing", action="store_true", required=False, help="Enable tracing." + ) + parser.add_argument( + "--enable-nvtx", action="store_true", required=False, help="Enable NVTX." + ) + parser.add_argument( + "--enable-gpu", action="store_true", required=False, help="Enable GPU support." + ) + parser.add_argument( + "--enable-mali-gpu", + action="store_true", + required=False, + help="Enable ARM MALI GPU support.", + ) + parser.add_argument( + "--min-compute-capability", type=str, required=False, - default='6.0', - help='Minimum CUDA compute capability supported by server.') + default="6.0", + help="Minimum CUDA compute capability supported by server.", + ) parser.add_argument( - '--endpoint', - action='append', + "--endpoint", + action="append", required=False, - help= - 'Include specified endpoint in build. Allowed values are "grpc", "http", "vertex-ai" and "sagemaker".' + help='Include specified endpoint in build. Allowed values are "grpc", "http", "vertex-ai" and "sagemaker".', ) parser.add_argument( - '--filesystem', - action='append', + "--filesystem", + action="append", required=False, - help= - 'Include specified filesystem in build. Allowed values are "gcs", "azure_storage" and "s3".' + help='Include specified filesystem in build. Allowed values are "gcs", "azure_storage" and "s3".', ) parser.add_argument( - '--no-core-build', + "--no-core-build", action="store_true", required=False, - help='Do not build Triton core sharead library or executable.') + help="Do not build Triton core shared library or executable.", + ) parser.add_argument( - '--backend', - action='append', + "--backend", + action="append", required=False, - help= - 'Include specified backend in build as [:]. If starts with "pull/" then it refers to a pull-request reference, otherwise indicates the git tag/branch to use for the build. If the version is non-development then the default is the release branch matching the container version (e.g. version 22.05 -> branch r22.05); otherwise the default is "main" (e.g. version 22.05dev -> branch main).' + help='Include specified backend in build as [:]. If starts with "pull/" then it refers to a pull-request reference, otherwise indicates the git tag/branch to use for the build. If the version is non-development then the default is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default is "main" (e.g. version YY.MMdev -> branch main).', ) parser.add_argument( - '--build-multiple-openvino', - action="store_true", - default=False, - help= - 'Build multiple openVINO versions as specified in TRITON_VERSION_MAP. Be aware that loading backends with different openvino versions simultaneously in triton can cause conflicts' + "--repo-tag", + action="append", + required=False, + help='The version of a component to use in the build as :. can be "common", "core", "backend" or "thirdparty". indicates the git tag/branch to use for the build. Currently does not support pull-request reference. If the version is non-development then the default is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default is "main" (e.g. version YY.MMdev -> branch main).', ) parser.add_argument( - '--repo-tag', - action='append', + "--repoagent", + action="append", required=False, - help= - 'The version of a component to use in the build as :. can be "common", "core", "backend" or "thirdparty". If starts with "pull/" then it refers to a pull-request reference, otherwise indicates the git tag/branch. If the version is non-development then the default is the release branch matching the container version (e.g. version 22.05 -> branch r22.05); otherwise the default is "main" (e.g. version 22.05dev -> branch main).' + help='Include specified repo agent in build as [:]. If starts with "pull/" then it refers to a pull-request reference, otherwise indicates the git tag/branch to use for the build. If the version is non-development then the default is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default is "main" (e.g. version YY.MMdev -> branch main).', ) parser.add_argument( - '--repoagent', - action='append', + "--cache", + action="append", required=False, - help= - 'Include specified repo agent in build as [:]. If starts with "pull/" then it refers to a pull-request reference, otherwise indicates the git tag/branch to use for the build. If the version is non-development then the default is the release branch matching the container version (e.g. version 22.05 -> branch r22.05); otherwise the default is "main" (e.g. version 22.05dev -> branch main).' + help='Include specified cache in build as [:]. If starts with "pull/" then it refers to a pull-request reference, otherwise indicates the git tag/branch to use for the build. If the version is non-development then the default is the release branch matching the container version (e.g. version YY.MM -> branch rYY.MM); otherwise the default is "main" (e.g. version YY.MMdev -> branch main).', ) parser.add_argument( - '--no-force-clone', + "--no-force-clone", action="store_true", default=False, - help='Do not create fresh clones of repos that have already been cloned.' + help="Do not create fresh clones of repos that have already been cloned.", ) parser.add_argument( - '--extra-core-cmake-arg', - action='append', + "--extra-core-cmake-arg", + action="append", required=False, - help= - 'Extra CMake argument as =. The argument is passed to CMake as -D= and is included after all CMake arguments added by build.py for the core builds.' + help="Extra CMake argument as =. The argument is passed to CMake as -D= and is included after all CMake arguments added by build.py for the core builds.", ) parser.add_argument( - '--override-core-cmake-arg', - action='append', + "--override-core-cmake-arg", + action="append", required=False, - help= - 'Override specified CMake argument in the build as =. The argument is passed to CMake as -D=. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the core build use --extra-core-cmake-arg.' + help="Override specified CMake argument in the build as =. The argument is passed to CMake as -D=. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the core build use --extra-core-cmake-arg.", ) parser.add_argument( - '--extra-backend-cmake-arg', - action='append', + "--extra-backend-cmake-arg", + action="append", required=False, - help= - 'Extra CMake argument for a backend build as :=. The argument is passed to CMake as -D= and is included after all CMake arguments added by build.py for the backend.' + help="Extra CMake argument for a backend build as :=. The argument is passed to CMake as -D= and is included after all CMake arguments added by build.py for the backend.", ) parser.add_argument( - '--override-backend-cmake-arg', - action='append', + "--override-backend-cmake-arg", + action="append", required=False, - help= - 'Override specified backend CMake argument in the build as :=. The argument is passed to CMake as -D=. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the backend build use --extra-backend-cmake-arg.' + help="Override specified backend CMake argument in the build as :=. The argument is passed to CMake as -D=. This flag only impacts CMake arguments that are used by build.py. To unconditionally add a CMake argument to the backend build use --extra-backend-cmake-arg.", ) FLAGS = parser.parse_args() @@ -2046,6 +2387,8 @@ def enable_all(): FLAGS.filesystem = [] if FLAGS.repoagent is None: FLAGS.repoagent = [] + if FLAGS.cache is None: + FLAGS.cache = [] if FLAGS.library_paths is None: FLAGS.library_paths = [] if FLAGS.extra_core_cmake_arg is None: @@ -2058,7 +2401,7 @@ def enable_all(): FLAGS.extra_backend_cmake_arg = [] # if --enable-all is specified, then update FLAGS to enable all - # settings, backends, repo-agents, file systems, endpoints, etc. + # settings, backends, repo-agents, caches, file systems, endpoints, etc. if FLAGS.enable_all: enable_all() @@ -2069,64 +2412,63 @@ def enable_all(): # set. if FLAGS.no_container_build: if FLAGS.build_dir is None: - fail('--no-container-build requires --build-dir') + fail("--no-container-build requires --build-dir") if FLAGS.install_dir is None: - FLAGS.install_dir = os.path.join(FLAGS.build_dir, "opt", - "tritonserver") + FLAGS.install_dir = os.path.join(FLAGS.build_dir, "opt", "tritonserver") if FLAGS.cmake_dir is None: FLAGS.cmake_dir = THIS_SCRIPT_DIR else: if FLAGS.build_dir is not None: - fail('--build-dir must not be set for container-based build') + fail("--build-dir must not be set for container-based build") if FLAGS.install_dir is not None: - fail('--install-dir must not be set for container-based build') + fail("--install-dir must not be set for container-based build") if FLAGS.cmake_dir is not None: - fail('--cmake-dir must not be set for container-based build') - FLAGS.build_dir = os.path.join(THIS_SCRIPT_DIR, 'build') + fail("--cmake-dir must not be set for container-based build") + FLAGS.build_dir = os.path.join(THIS_SCRIPT_DIR, "build") # Determine the versions. Start with Triton version, if --version # is not explicitly specified read from TRITON_VERSION file. if FLAGS.version is None: - with open(os.path.join(THIS_SCRIPT_DIR, 'TRITON_VERSION'), - "r") as vfile: + with open(os.path.join(THIS_SCRIPT_DIR, "TRITON_VERSION"), "r") as vfile: FLAGS.version = vfile.readline().strip() if FLAGS.build_parallel is None: FLAGS.build_parallel = multiprocessing.cpu_count() * 2 - log('Building Triton Inference Server') - log('platform {}'.format(target_platform())) - log('machine {}'.format(target_machine())) - log('version {}'.format(FLAGS.version)) - log('build dir {}'.format(FLAGS.build_dir)) - log('install dir {}'.format(FLAGS.install_dir)) - log('cmake dir {}'.format(FLAGS.cmake_dir)) + log("Building Triton Inference Server") + log("platform {}".format(target_platform())) + log("machine {}".format(target_machine())) + log("version {}".format(FLAGS.version)) + log("build dir {}".format(FLAGS.build_dir)) + log("install dir {}".format(FLAGS.install_dir)) + log("cmake dir {}".format(FLAGS.cmake_dir)) # Determine the default repo-tag that should be used for images, - # backends and repo-agents if a repo-tag is not given + # backends, repo-agents, and caches if a repo-tag is not given # explicitly. For release branches we use the release branch as # the default, otherwise we use 'main'. - default_repo_tag = 'main' + default_repo_tag = "main" cver = FLAGS.container_version if cver is None: if FLAGS.version not in TRITON_VERSION_MAP: fail( - 'unable to determine default repo-tag, container version not known for {}' - .format(FLAGS.version)) + "unable to determine default repo-tag, container version not known for {}".format( + FLAGS.version + ) + ) cver = TRITON_VERSION_MAP[FLAGS.version][0] - if not cver.endswith('dev'): - default_repo_tag = 'r' + cver - log('default repo-tag: {}'.format(default_repo_tag)) + if not cver.endswith("dev"): + default_repo_tag = "r" + cver + log("default repo-tag: {}".format(default_repo_tag)) # For other versions use the TRITON_VERSION_MAP unless explicitly # given. FLAGS.container_version, FLAGS.upstream_container_version = container_versions( - FLAGS.version, FLAGS.container_version, - FLAGS.upstream_container_version) + FLAGS.version, FLAGS.container_version, FLAGS.upstream_container_version + ) - log('container version {}'.format(FLAGS.container_version)) - log('upstream container version {}'.format( - FLAGS.upstream_container_version)) + log("container version {}".format(FLAGS.container_version)) + log("upstream container version {}".format(FLAGS.upstream_container_version)) for ep in FLAGS.endpoint: log(f'endpoint "{ep}"') @@ -2136,116 +2478,146 @@ def enable_all(): # Initialize map of backends to build and repo-tag for each. backends = {} for be in FLAGS.backend: - parts = be.split(':') + parts = be.split(":") if len(parts) == 1: parts.append(default_repo_tag) + if parts[0] == "tensorflow1": + fail( + "Starting from Triton version 23.04, support for TensorFlow 1 has been discontinued. Please switch to Tensorflow 2." + ) + if parts[0] == "tensorflow2": + parts[0] = "tensorflow" log('backend "{}" at tag/branch "{}"'.format(parts[0], parts[1])) backends[parts[0]] = parts[1] + if "vllm" in backends: + if "python" not in backends: + log( + "vLLM backend requires Python backend, adding Python backend with tag {}".format( + backends["vllm"] + ) + ) + backends["python"] = backends["vllm"] + # Initialize map of repo agents to build and repo-tag for each. repoagents = {} for be in FLAGS.repoagent: - parts = be.split(':') + parts = be.split(":") if len(parts) == 1: parts.append(default_repo_tag) log('repoagent "{}" at tag/branch "{}"'.format(parts[0], parts[1])) repoagents[parts[0]] = parts[1] + # Initialize map of caches to build and repo-tag for each. + caches = {} + for be in FLAGS.cache: + parts = be.split(":") + if len(parts) == 1: + parts.append(default_repo_tag) + log('cache "{}" at tag/branch "{}"'.format(parts[0], parts[1])) + caches[parts[0]] = parts[1] + # Initialize map of docker images. images = {} for img in FLAGS.image: - parts = img.split(',') + parts = img.split(",") fail_if( - len(parts) != 2, - '--image must specify ,') + len(parts) != 2, "--image must specify ," + ) fail_if( - parts[0] not in [ - 'base', 'gpu-base', 'pytorch', 'tensorflow1', 'tensorflow2' - ], 'unsupported value for --image') + parts[0] + not in ["base", "gpu-base", "pytorch", "tensorflow", "tensorflow2"], + "unsupported value for --image", + ) log('image "{}": "{}"'.format(parts[0], parts[1])) + if parts[0] == "tensorflow2": + parts[0] = "tensorflow" images[parts[0]] = parts[1] # Initialize map of library paths for each backend. library_paths = {} for lpath in FLAGS.library_paths: - parts = lpath.split(':') + parts = lpath.split(":") if len(parts) == 2: log('backend "{}" library path "{}"'.format(parts[0], parts[1])) + if parts[0] == "tensorflow2": + parts[0] = "tensorflow" library_paths[parts[0]] = parts[1] # Parse any explicitly specified cmake arguments for cf in FLAGS.extra_core_cmake_arg: - parts = cf.split('=') - fail_if( - len(parts) != 2, - '--extra-core-cmake-arg must specify =') + parts = cf.split("=") + fail_if(len(parts) != 2, "--extra-core-cmake-arg must specify =") log('CMake core extra "-D{}={}"'.format(parts[0], parts[1])) EXTRA_CORE_CMAKE_FLAGS[parts[0]] = parts[1] for cf in FLAGS.override_core_cmake_arg: - parts = cf.split('=') + parts = cf.split("=") fail_if( - len(parts) != 2, - '--override-core-cmake-arg must specify =') + len(parts) != 2, "--override-core-cmake-arg must specify =" + ) log('CMake core override "-D{}={}"'.format(parts[0], parts[1])) OVERRIDE_CORE_CMAKE_FLAGS[parts[0]] = parts[1] for cf in FLAGS.extra_backend_cmake_arg: - parts = cf.split(':', 1) + parts = cf.split(":", 1) fail_if( len(parts) != 2, - '--extra-backend-cmake-arg must specify :=') + "--extra-backend-cmake-arg must specify :=", + ) be = parts[0] - parts = parts[1].split('=', 1) + parts = parts[1].split("=", 1) fail_if( len(parts) != 2, - '--extra-backend-cmake-arg must specify :=') + "--extra-backend-cmake-arg must specify :=", + ) fail_if( be not in backends, - '--extra-backend-cmake-arg specifies backend "{}" which is not included in build' - .format(be)) + '--extra-backend-cmake-arg specifies backend "{}" which is not included in build'.format( + be + ), + ) log('backend "{}" CMake extra "-D{}={}"'.format(be, parts[0], parts[1])) if be not in EXTRA_BACKEND_CMAKE_FLAGS: EXTRA_BACKEND_CMAKE_FLAGS[be] = {} EXTRA_BACKEND_CMAKE_FLAGS[be][parts[0]] = parts[1] for cf in FLAGS.override_backend_cmake_arg: - parts = cf.split(':', 1) + parts = cf.split(":", 1) fail_if( len(parts) != 2, - '--override-backend-cmake-arg must specify :=' + "--override-backend-cmake-arg must specify :=", ) be = parts[0] - parts = parts[1].split('=', 1) + parts = parts[1].split("=", 1) fail_if( len(parts) != 2, - '--override-backend-cmake-arg must specify :=' + "--override-backend-cmake-arg must specify :=", ) fail_if( be not in backends, - '--override-backend-cmake-arg specifies backend "{}" which is not included in build' - .format(be)) - log('backend "{}" CMake override "-D{}={}"'.format( - be, parts[0], parts[1])) + '--override-backend-cmake-arg specifies backend "{}" which is not included in build'.format( + be + ), + ) + log('backend "{}" CMake override "-D{}={}"'.format(be, parts[0], parts[1])) if be not in OVERRIDE_BACKEND_CMAKE_FLAGS: OVERRIDE_BACKEND_CMAKE_FLAGS[be] = {} OVERRIDE_BACKEND_CMAKE_FLAGS[be][parts[0]] = parts[1] # Initialize map of common components and repo-tag for each. components = { - 'common': default_repo_tag, - 'core': default_repo_tag, - 'backend': default_repo_tag, - 'thirdparty': default_repo_tag + "common": default_repo_tag, + "core": default_repo_tag, + "backend": default_repo_tag, + "thirdparty": default_repo_tag, } for be in FLAGS.repo_tag: - parts = be.split(':') - fail_if( - len(parts) != 2, - '--repo-tag must specify :') + parts = be.split(":") + fail_if(len(parts) != 2, "--repo-tag must specify :") fail_if( parts[0] not in components, - '--repo-tag must be "common", "core", "backend", or "thirdparty"' + '--repo-tag must be "common", "core", "backend", or "thirdparty"', ) components[parts[0]] = parts[1] for c in components: @@ -2264,94 +2636,119 @@ def enable_all(): # FLAGS.tmp_dir may be specified with "\" on Windows, adjust # to "/" for docker usage. script_build_dir = os.path.normpath( - os.path.join(FLAGS.tmp_dir, 'tritonbuild').replace("\\", "/")) - script_install_dir = os.path.normpath( - os.path.join(script_build_dir, 'install')) - script_ci_dir = os.path.normpath(os.path.join(script_build_dir, 'ci')) - if target_platform() == 'windows': - script_repo_dir = script_cmake_dir = os.path.normpath( - 'c:/workspace') + os.path.join(FLAGS.tmp_dir, "tritonbuild").replace("\\", "/") + ) + script_install_dir = os.path.normpath(os.path.join(script_build_dir, "install")) + script_ci_dir = os.path.normpath(os.path.join(script_build_dir, "ci")) + if target_platform() == "windows": + script_repo_dir = script_cmake_dir = os.path.normpath("c:/workspace") else: - script_repo_dir = script_cmake_dir = '/workspace' + script_repo_dir = script_cmake_dir = "/workspace" - script_name = 'cmake_build' - if target_platform() == 'windows': - script_name += '.ps1' + script_name = "cmake_build" + if target_platform() == "windows": + script_name += ".ps1" - # Write the build script that invokes cmake for the core, backends, and repo-agents. + # Write the build script that invokes cmake for the core, backends, repo-agents, and caches. pathlib.Path(FLAGS.build_dir).mkdir(parents=True, exist_ok=True) with BuildScript( - os.path.join(FLAGS.build_dir, script_name), - verbose=FLAGS.verbose, - desc=('Build script for Triton Inference Server')) as cmake_script: - + os.path.join(FLAGS.build_dir, script_name), + verbose=FLAGS.verbose, + desc=("Build script for Triton Inference Server"), + ) as cmake_script: # Run the container pre-build command if the cmake build is # being done within the build container. if not FLAGS.no_container_build and FLAGS.container_prebuild_command: - cmake_script.cmd(FLAGS.container_prebuild_command, - check_exitcode=True) + cmake_script.cmd(FLAGS.container_prebuild_command, check_exitcode=True) cmake_script.blankln() # Commands to build the core shared library and the server executable. if not FLAGS.no_core_build: - core_build(cmake_script, script_repo_dir, script_cmake_dir, - script_build_dir, script_install_dir, components, - backends) + core_build( + cmake_script, + script_repo_dir, + script_cmake_dir, + script_build_dir, + script_install_dir, + components, + backends, + ) # Commands to build each backend... for be in backends: # Core backends are not built separately from core so skip... - if (be in CORE_BACKENDS): + if be in CORE_BACKENDS: continue - tagged_be_list = [] - if (be == 'openvino'): - tagged_be_list.append( - tagged_backend(be, TRITON_VERSION_MAP[FLAGS.version][4][0])) - if (FLAGS.build_multiple_openvino): - skip = True - for ver in TRITON_VERSION_MAP[FLAGS.version][4]: - if not skip: - tagged_be_list.append(tagged_backend(be, ver)) - skip = False - # If armnn_tflite backend, source from external repo for git clone - if be == 'armnn_tflite': - github_organization = 'https://gitlab.com/arm-research/smarter/' + if be == "armnn_tflite": + github_organization = "https://gitlab.com/arm-research/smarter/" else: github_organization = FLAGS.github_organization - if not tagged_be_list: - backend_build(be, cmake_script, backends[be], script_build_dir, - script_install_dir, github_organization, images, - components, library_paths) + if be == "vllm": + backend_clone( + be, + cmake_script, + backends[be], + script_build_dir, + script_install_dir, + github_organization, + ) else: - variant_index = 0 - for tagged_be in tagged_be_list: - backend_build(tagged_be, cmake_script, backends[be], - script_build_dir, script_install_dir, - github_organization, images, components, - library_paths, variant_index) - variant_index += 1 + backend_build( + be, + cmake_script, + backends[be], + script_build_dir, + script_install_dir, + github_organization, + images, + components, + library_paths, + ) # Commands to build each repo agent... for ra in repoagents: - repo_agent_build(ra, cmake_script, script_build_dir, - script_install_dir, repoagent_repo, repoagents) + repo_agent_build( + ra, + cmake_script, + script_build_dir, + script_install_dir, + repoagent_repo, + repoagents, + ) + + # Commands to build each cache... + for cache in caches: + cache_build( + cache, + cmake_script, + script_build_dir, + script_install_dir, + cache_repo, + caches, + ) # Commands needed only when building with Docker... if not FLAGS.no_container_build: # Commands to collect all the build artifacts needed for CI # testing. - cibase_build(cmake_script, script_repo_dir, script_cmake_dir, - script_build_dir, script_install_dir, script_ci_dir, - backends) + cibase_build( + cmake_script, + script_repo_dir, + script_cmake_dir, + script_build_dir, + script_install_dir, + script_ci_dir, + backends, + ) # When building with Docker the install and ci artifacts # written to the build-dir while running the docker container # may have root ownership, so give them permissions to be # managed by all users on the host system. - if target_platform() != 'windows': + if target_platform() != "windows": finalize_build(cmake_script, script_install_dir, script_ci_dir) # If --no-container-build is not specified then we perform the @@ -2360,24 +2757,25 @@ def enable_all(): # generate a few Dockerfiles and a top-level script that drives # the build process. if not FLAGS.no_container_build: - script_name = 'docker_build' - if target_platform() == 'windows': - script_name += '.ps1' + script_name = "docker_build" + if target_platform() == "windows": + script_name += ".ps1" - create_build_dockerfiles(script_build_dir, images, backends, repoagents, - FLAGS.endpoint) - create_docker_build_script(script_name, script_install_dir, - script_ci_dir) + create_build_dockerfiles( + script_build_dir, images, backends, repoagents, caches, FLAGS.endpoint + ) + create_docker_build_script(script_name, script_install_dir, script_ci_dir) # In not dry-run, execute the script to perform the build... If a # container-based build is requested use 'docker_build' script, # otherwise build directly on this system using cmake script. if not FLAGS.dryrun: - if target_platform() == 'windows': + if target_platform() == "windows": p = subprocess.Popen( - ['powershell.exe', '-noexit', '-File', f'./{script_name}'], - cwd=FLAGS.build_dir) + ["powershell.exe", "-noexit", "-File", f"./{script_name}"], + cwd=FLAGS.build_dir, + ) else: - p = subprocess.Popen([f'./{script_name}'], cwd=FLAGS.build_dir) + p = subprocess.Popen([f"./{script_name}"], cwd=FLAGS.build_dir) p.wait() - fail_if(p.returncode != 0, 'build failed') + fail_if(p.returncode != 0, "build failed") diff --git a/compose.py b/compose.py old mode 100644 new mode 100755 index 095cac9174..9f948c14fd --- a/compose.py +++ b/compose.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -39,7 +39,7 @@ def log(msg, force=False): try: print(msg, file=sys.stderr) except Exception: - print('', file=sys.stderr) + print("", file=sys.stderr) def log_verbose(msg): @@ -48,7 +48,7 @@ def log_verbose(msg): def fail(msg): - print('error: {}'.format(msg), file=sys.stderr) + print("error: {}".format(msg), file=sys.stderr) sys.exit(1) @@ -58,8 +58,8 @@ def fail_if(p, msg): def start_dockerfile(ddir, images, argmap, dockerfile_name, backends): - # Set enviroment variables, set default user and install dependencies - df = ''' + # Set environment variables, set default user and install dependencies + df = """ # # Multistage build. # @@ -67,30 +67,38 @@ def start_dockerfile(ddir, images, argmap, dockerfile_name, backends): ARG TRITON_CONTAINER_VERSION={} FROM {} AS full -'''.format(argmap['TRITON_VERSION'], argmap['TRITON_CONTAINER_VERSION'], - images["full"]) +""".format( + argmap["TRITON_VERSION"], argmap["TRITON_CONTAINER_VERSION"], images["full"] + ) # PyTorch, TensorFlow 1 and TensorFlow 2 backends need extra CUDA and other # dependencies during runtime that are missing in the CPU-only base container. # These dependencies must be copied from the Triton Min image. - if not FLAGS.enable_gpu and (('pytorch' in backends) or - ('tensorflow1' in backends) or - ('tensorflow2' in backends)): - df += ''' + if not FLAGS.enable_gpu and ( + ("pytorch" in backends) + or ("tensorflow1" in backends) + or ("tensorflow2" in backends) + ): + df += """ FROM {} AS min_container -'''.format(images["gpu-min"]) +""".format( + images["gpu-min"] + ) - df += ''' + df += """ FROM {} -'''.format(images["min"]) +""".format( + images["min"] + ) import build - df += build.dockerfile_prepare_container_linux(argmap, backends, - FLAGS.enable_gpu, - platform.machine().lower()) + + df += build.dockerfile_prepare_container_linux( + argmap, backends, FLAGS.enable_gpu, platform.machine().lower() + ) # Copy over files - df += ''' + df += """ WORKDIR /opt/tritonserver COPY --chown=1000:1000 --from=full /opt/tritonserver/LICENSE . COPY --chown=1000:1000 --from=full /opt/tritonserver/TRITON_VERSION . @@ -98,7 +106,7 @@ def start_dockerfile(ddir, images, argmap, dockerfile_name, backends): COPY --chown=1000:1000 --from=full /opt/tritonserver/bin bin/ COPY --chown=1000:1000 --from=full /opt/tritonserver/lib lib/ COPY --chown=1000:1000 --from=full /opt/tritonserver/include include/ -''' +""" with open(os.path.join(ddir, dockerfile_name), "w") as dfile: dfile.write(df) @@ -106,17 +114,15 @@ def start_dockerfile(ddir, images, argmap, dockerfile_name, backends): def add_requested_backends(ddir, dockerfile_name, backends): df = "# Copying over backends \n" for backend in backends: - if backend == 'openvino': - import build - ver = next(iter(build.TRITON_VERSION_MAP.values())) - backend = build.tagged_backend(backend, ver[4][0]) - df += '''COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/{} /opt/tritonserver/backends/{} -'''.format(backend, backend) + df += """COPY --chown=1000:1000 --from=full /opt/tritonserver/backends/{} /opt/tritonserver/backends/{} +""".format( + backend, backend + ) if len(backends) > 0: - df += ''' + df += """ # Top-level /opt/tritonserver/backends not copied so need to explicitly set permissions here RUN chown triton-server:triton-server /opt/tritonserver/backends -''' +""" with open(os.path.join(ddir, dockerfile_name), "a") as dfile: dfile.write(df) @@ -124,13 +130,31 @@ def add_requested_backends(ddir, dockerfile_name, backends): def add_requested_repoagents(ddir, dockerfile_name, repoagents): df = "# Copying over repoagents \n" for ra in repoagents: - df += '''COPY --chown=1000:1000 --from=full /opt/tritonserver/repoagents/{} /opt/tritonserver/repoagents/{} -'''.format(ra, ra) + df += """COPY --chown=1000:1000 --from=full /opt/tritonserver/repoagents/{} /opt/tritonserver/repoagents/{} +""".format( + ra, ra + ) if len(repoagents) > 0: - df += ''' + df += """ # Top-level /opt/tritonserver/repoagents not copied so need to explicitly set permissions here RUN chown triton-server:triton-server /opt/tritonserver/repoagents -''' +""" + with open(os.path.join(ddir, dockerfile_name), "a") as dfile: + dfile.write(df) + + +def add_requested_caches(ddir, dockerfile_name, caches): + df = "# Copying over caches \n" + for cache in caches: + df += """COPY --chown=1000:1000 --from=full /opt/tritonserver/caches/{} /opt/tritonserver/caches/{} +""".format( + cache, cache + ) + if len(caches) > 0: + df += """ +# Top-level /opt/tritonserver/caches not copied so need to explicitly set permissions here +RUN chown triton-server:triton-server /opt/tritonserver/caches +""" with open(os.path.join(ddir, dockerfile_name), "a") as dfile: dfile.write(df) @@ -138,226 +162,292 @@ def add_requested_repoagents(ddir, dockerfile_name, repoagents): def end_dockerfile(ddir, dockerfile_name, argmap): # Install additional dependencies df = "" - if argmap['SAGEMAKER_ENDPOINT']: - df += ''' + if argmap["SAGEMAKER_ENDPOINT"]: + df += """ LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true COPY --chown=1000:1000 --from=full /usr/bin/serve /usr/bin/. -''' +""" with open(os.path.join(ddir, dockerfile_name), "a") as dfile: dfile.write(df) def build_docker_image(ddir, dockerfile_name, container_name): # Create container with docker build - p = subprocess.Popen(['docker', 'build', '-t', container_name, '-f', \ - os.path.join(ddir, dockerfile_name), '.']) + p = subprocess.Popen( + [ + "docker", + "build", + "-t", + container_name, + "-f", + os.path.join(ddir, dockerfile_name), + ".", + ] + ) p.wait() - fail_if(p.returncode != 0, 'docker build {} failed'.format(container_name)) + fail_if(p.returncode != 0, "docker build {} failed".format(container_name)) def get_container_version_if_not_specified(): if FLAGS.container_version is None: # Read from TRITON_VERSION file in server repo to determine version - with open('TRITON_VERSION', "r") as vfile: + with open("TRITON_VERSION", "r") as vfile: version = vfile.readline().strip() import build + _, FLAGS.container_version = build.container_versions( - version, None, FLAGS.container_version) - log('version {}'.format(version)) - log('using container version {}'.format(FLAGS.container_version)) + version, None, FLAGS.container_version + ) + log("version {}".format(version)) + log("using container version {}".format(FLAGS.container_version)) -def create_argmap(images): +def create_argmap(images, skip_pull): # Extract information from upstream build and create map other functions can # use full_docker_image = images["full"] min_docker_image = images["min"] enable_gpu = FLAGS.enable_gpu - # Docker inspect enviroment variables - base_run_args = ['docker', 'inspect', '-f'] - import re # parse all PATH enviroment variables + # Docker inspect environment variables + base_run_args = ["docker", "inspect", "-f"] + import re # parse all PATH environment variables # first pull docker images - log("pulling container:{}".format(full_docker_image)) - p = subprocess.run(['docker', 'pull', full_docker_image]) - fail_if( - p.returncode != 0, - 'docker pull container {} failed, {}'.format(full_docker_image, - p.stderr)) - if enable_gpu: - pm = subprocess.run(['docker', 'pull', min_docker_image]) + if not skip_pull: + log("pulling container:{}".format(full_docker_image)) + p = subprocess.run(["docker", "pull", full_docker_image]) fail_if( - pm.returncode != 0, 'docker pull container {} failed, {}'.format( - min_docker_image, pm.stderr)) - pm_path = subprocess.run(base_run_args + [ - '{{range $index, $value := .Config.Env}}{{$value}} {{end}}', - min_docker_image - ], - capture_output=True, - text=True) + p.returncode != 0, + "docker pull container {} failed, {}".format(full_docker_image, p.stderr), + ) + if enable_gpu: + if not skip_pull: + pm = subprocess.run(["docker", "pull", min_docker_image]) + fail_if( + pm.returncode != 0 and not skip_pull, + "docker pull container {} failed, {}".format( + min_docker_image, pm.stderr + ), + ) + pm_path = subprocess.run( + base_run_args + + [ + "{{range $index, $value := .Config.Env}}{{$value}} {{end}}", + min_docker_image, + ], + capture_output=True, + text=True, + ) fail_if( pm_path.returncode != 0, - 'docker inspect to find triton enviroment variables for min container failed, {}' - .format(pm_path.stderr)) - # min container needs to be GPU support enabled if the build is GPU build + "docker inspect to find triton environment variables for min container failed, {}".format( + pm_path.stderr + ), + ) + # min container needs to be GPU-support-enabled if the build is GPU build vars = pm_path.stdout e = re.search("CUDA_VERSION", vars) gpu_enabled = False if e is None else True fail_if( not gpu_enabled, - 'Composing container with gpu support enabled but min container provided does not have CUDA installed' + "Composing container with gpu support enabled but min container provided does not have CUDA installed", ) - # Check full container enviroment variables - p_path = subprocess.run(base_run_args + [ - '{{range $index, $value := .Config.Env}}{{$value}} {{end}}', - full_docker_image - ], - capture_output=True, - text=True) + # Check full container environment variables + p_path = subprocess.run( + base_run_args + + [ + "{{range $index, $value := .Config.Env}}{{$value}} {{end}}", + full_docker_image, + ], + capture_output=True, + text=True, + ) fail_if( p_path.returncode != 0, - 'docker inspect to find enviroment variables for full container failed, {}' - .format(p_path.stderr)) + "docker inspect to find environment variables for full container failed, {}".format( + p_path.stderr + ), + ) vars = p_path.stdout log_verbose("inspect args: {}".format(vars)) e0 = re.search("TRITON_SERVER_GPU_ENABLED=([\S]{1,}) ", vars) e1 = re.search("CUDA_VERSION", vars) gpu_enabled = False - if (e0 != None): + if e0 != None: gpu_enabled = e0.group(1) == "1" - elif (e1 != None): + elif e1 != None: gpu_enabled = True fail_if( gpu_enabled != enable_gpu, - 'Error: full container provided was build with \'TRITON_SERVER_GPU_ENABLED\' as {} and you are composing container with \'TRITON_SERVER_GPU_ENABLED\' as {}' - .format(gpu_enabled, enable_gpu)) + "Error: full container provided was build with " + "'TRITON_SERVER_GPU_ENABLED' as {} and you are composing container" + "with 'TRITON_SERVER_GPU_ENABLED' as {}".format(gpu_enabled, enable_gpu), + ) e = re.search("TRITON_SERVER_VERSION=([\S]{6,}) ", vars) version = "" if e is None else e.group(1) fail_if( len(version) == 0, - 'docker inspect to find triton server version failed, {}'.format( - p_path.stderr)) + "docker inspect to find triton server version failed, {}".format(p_path.stderr), + ) e = re.search("NVIDIA_TRITON_SERVER_VERSION=([\S]{5,}) ", vars) container_version = "" if e is None else e.group(1) fail_if( len(container_version) == 0, - 'docker inspect to find triton container version failed, {}'.format( - vars)) + "docker inspect to find triton container version failed, {}".format(vars), + ) dcgm_ver = re.search("DCGM_VERSION=([\S]{4,}) ", vars) dcgm_version = "" if dcgm_ver is None: dcgm_version = "2.2.3" - log("WARNING: DCGM version not found from image, installing the earlierst version {}" - .format(dcgm_version)) + log( + "WARNING: DCGM version not found from image, installing the earlierst version {}".format( + dcgm_version + ) + ) else: dcgm_version = dcgm_ver.group(1) fail_if( len(dcgm_version) == 0, - 'docker inspect to find DCGM version failed, {}'.format(vars)) + "docker inspect to find DCGM version failed, {}".format(vars), + ) p_sha = subprocess.run( - base_run_args + - ['{{ index .Config.Labels "com.nvidia.build.ref"}}', full_docker_image], + base_run_args + + ['{{ index .Config.Labels "com.nvidia.build.ref"}}', full_docker_image], capture_output=True, - text=True) + text=True, + ) fail_if( p_sha.returncode != 0, - 'docker inspect of upstream docker image build sha failed, {}'.format( - p_sha.stderr)) + "docker inspect of upstream docker image build sha failed, {}".format( + p_sha.stderr + ), + ) p_build = subprocess.run( - base_run_args + - ['{{ index .Config.Labels "com.nvidia.build.id"}}', full_docker_image], + base_run_args + + ['{{ index .Config.Labels "com.nvidia.build.id"}}', full_docker_image], capture_output=True, - text=True) + text=True, + ) fail_if( p_build.returncode != 0, - 'docker inspect of upstream docker image build sha failed, {}'.format( - p_build.stderr)) + "docker inspect of upstream docker image build sha failed, {}".format( + p_build.stderr + ), + ) p_find = subprocess.run( - ['docker', 'run', full_docker_image, 'bash', '-c', 'ls /usr/bin/'], + ["docker", "run", full_docker_image, "bash", "-c", "ls /usr/bin/"], capture_output=True, - text=True) + text=True, + ) f = re.search("serve", p_find.stdout) - fail_if(p_find.returncode != 0, - "Cannot search for 'serve' in /usr/bin, {}".format(p_find.stderr)) + fail_if( + p_find.returncode != 0, + "Cannot search for 'serve' in /usr/bin, {}".format(p_find.stderr), + ) argmap = { - 'NVIDIA_BUILD_REF': p_sha.stdout.rstrip(), - 'NVIDIA_BUILD_ID': p_build.stdout.rstrip(), - 'TRITON_VERSION': version, - 'TRITON_CONTAINER_VERSION': container_version, - 'DCGM_VERSION': dcgm_version, - 'SAGEMAKER_ENDPOINT': f is not None, + "NVIDIA_BUILD_REF": p_sha.stdout.rstrip(), + "NVIDIA_BUILD_ID": p_build.stdout.rstrip(), + "TRITON_VERSION": version, + "TRITON_CONTAINER_VERSION": container_version, + "DCGM_VERSION": dcgm_version, + "SAGEMAKER_ENDPOINT": f is not None, } return argmap -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() group_qv = parser.add_mutually_exclusive_group() - group_qv.add_argument('-q', - '--quiet', - action="store_true", - required=False, - help='Disable console output.') - group_qv.add_argument('-v', - '--verbose', - action="store_true", - required=False, - help='Enable verbose output.') + group_qv.add_argument( + "-q", + "--quiet", + action="store_true", + required=False, + help="Disable console output.", + ) + group_qv.add_argument( + "-v", + "--verbose", + action="store_true", + required=False, + help="Enable verbose output.", + ) parser.add_argument( - '--output-name', + "--output-name", type=str, required=False, - help='Name for the generated Docker image. Default is "tritonserver".') + help='Name for the generated Docker image. Default is "tritonserver".', + ) parser.add_argument( - '--work-dir', + "--work-dir", type=str, required=False, - help= - 'Generated dockerfiles are placed here. Default to current directory.') + help="Generated dockerfiles are placed here. Default to current directory.", + ) parser.add_argument( - '--container-version', + "--container-version", type=str, required=False, - help= - 'The version to use for the generated Docker image. If not specified the container version will be chosen automatically based on the repository branch.' + help="The version to use for the generated Docker image. If not specified " + "the container version will be chosen automatically based on the " + "repository branch.", + ) + parser.add_argument( + "--image", + action="append", + required=False, + help="Use specified Docker image to generate Docker image. Specified as " + ',. can be "min", "gpu-min" ' + 'or "full". Both "min" and "full" need to be specified at the same time.' + 'This will override "--container-version". "gpu-min" is needed for ' + "CPU-only container to copy TensorFlow and PyTorch deps.", + ) + parser.add_argument( + "--enable-gpu", + nargs="?", + type=lambda x: (str(x).lower() == "true"), + const=True, + default=True, + required=False, + help=argparse.SUPPRESS, ) parser.add_argument( - '--image', - action='append', + "--backend", + action="append", required=False, - help= - 'Use specified Docker image to generate Docker image. Specified as ,. can be "min", "gpu-min" or "full". Both "min" and "full" need to be specified at the same time. This will override "--container-version". "gpu-min" is needed for CPU-only container to copy TensorFlow and PyTorch deps.' + help="Include in the generated Docker image. The flag may be " + "specified multiple times.", ) - parser.add_argument('--enable-gpu', - nargs='?', - type=lambda x: (str(x).lower() == 'true'), - const=True, - default=True, - required=False, - help=argparse.SUPPRESS) parser.add_argument( - '--backend', - action='append', + "--repoagent", + action="append", required=False, - help= - 'Include in the generated Docker image. The flag may be specified multiple times.' + help="Include in the generated Docker image. The flag may " + "be specified multiple times.", ) parser.add_argument( - '--repoagent', - action='append', + "--cache", + action="append", required=False, - help= - 'Include in the generated Docker image. The flag may be specified multiple times.' + help="Include in the generated Docker image. The flag may " + "be specified multiple times.", ) parser.add_argument( - '--dry-run', + "--skip-pull", action="store_true", required=False, - help='Only creates Dockerfile.compose, does not build the Docker image.' + help="Do not pull the required docker images. The user is responsible " + "for pulling the upstream images needed to compose the image.", + ) + parser.add_argument( + "--dry-run", + action="store_true", + required=False, + help="Only creates Dockerfile.compose, does not build the Docker image.", ) FLAGS = parser.parse_args() @@ -367,64 +457,69 @@ def create_argmap(images): if FLAGS.output_name is None: FLAGS.output_name = "tritonserver" - dockerfile_name = 'Dockerfile.compose' + dockerfile_name = "Dockerfile.compose" if FLAGS.backend is None: FLAGS.backend = [] if FLAGS.repoagent is None: FLAGS.repoagent = [] + if FLAGS.cache is None: + FLAGS.cache = [] # Initialize map of docker images. images = {} if FLAGS.image: for img in FLAGS.image: - parts = img.split(',') + parts = img.split(",") fail_if( len(parts) != 2, - '--image must specific ,') + "--image must specific ,", + ) fail_if( - parts[0] not in ['min', 'full', 'gpu-min'], - 'unsupported image-name \'{}\' for --image'.format(parts[0])) + parts[0] not in ["min", "full", "gpu-min"], + "unsupported image-name '{}' for --image".format(parts[0]), + ) log('image "{}": "{}"'.format(parts[0], parts[1])) images[parts[0]] = parts[1] else: get_container_version_if_not_specified() if FLAGS.enable_gpu: images = { - "full": - "nvcr.io/nvidia/tritonserver:{}-py3".format( - FLAGS.container_version), - "min": - "nvcr.io/nvidia/tritonserver:{}-py3-min".format( - FLAGS.container_version) + "full": "nvcr.io/nvidia/tritonserver:{}-py3".format( + FLAGS.container_version + ), + "min": "nvcr.io/nvidia/tritonserver:{}-py3-min".format( + FLAGS.container_version + ), } else: images = { - "full": - "nvcr.io/nvidia/tritonserver:{}-cpu-only-py3".format( - FLAGS.container_version), - "min": - "ubuntu:20.04" + "full": "nvcr.io/nvidia/tritonserver:{}-cpu-only-py3".format( + FLAGS.container_version + ), + "min": "ubuntu:22.04", } - fail_if( - len(images) < 2, - "Need to specify both 'full' and 'min' images if at all") + fail_if(len(images) < 2, "Need to specify both 'full' and 'min' images if at all") # For CPU-only image we need to copy some cuda libraries and dependencies # since we are using PyTorch, TensorFlow 1, TensorFlow 2 containers that # are not CPU-only. - if (('pytorch' in FLAGS.backend) or ('tensorflow1' in FLAGS.backend) or - ('tensorflow2' in FLAGS.backend)) and ('gpu-min' not in images): + if ( + ("pytorch" in FLAGS.backend) + or ("tensorflow1" in FLAGS.backend) + or ("tensorflow2" in FLAGS.backend) + ) and ("gpu-min" not in images): images["gpu-min"] = "nvcr.io/nvidia/tritonserver:{}-py3-min".format( - FLAGS.container_version) + FLAGS.container_version + ) - argmap = create_argmap(images) + argmap = create_argmap(images, FLAGS.skip_pull) - start_dockerfile(FLAGS.work_dir, images, argmap, dockerfile_name, - FLAGS.backend) + start_dockerfile(FLAGS.work_dir, images, argmap, dockerfile_name, FLAGS.backend) add_requested_backends(FLAGS.work_dir, dockerfile_name, FLAGS.backend) add_requested_repoagents(FLAGS.work_dir, dockerfile_name, FLAGS.repoagent) + add_requested_caches(FLAGS.work_dir, dockerfile_name, FLAGS.cache) end_dockerfile(FLAGS.work_dir, dockerfile_name, argmap) - if (not FLAGS.dry_run): + if not FLAGS.dry_run: build_docker_image(FLAGS.work_dir, dockerfile_name, FLAGS.output_name) diff --git a/deploy/alibaba-cloud/README.md b/deploy/alibaba-cloud/README.md index 1dea4ede11..98f914a693 100644 --- a/deploy/alibaba-cloud/README.md +++ b/deploy/alibaba-cloud/README.md @@ -1,5 +1,5 @@ -# Deploy Triton Inference Server on PAI-EAS +# Deploy Triton Inference Server on PAI-EAS * Table Of Contents - [Description](https://yuque.alibaba-inc.com/pai/blade/mtptqc#Description) - [Prerequisites](https://yuque.alibaba-inc.com/pai/blade/mtptqc#Prerequisites) @@ -57,11 +57,11 @@ Download the tensorflow inception model via [fetch_model.sh](https://github.com/ The following is the json we use when creating a Triton Server on EAS. ``` { - "name": "", + "name": "", "processor": "triton", "processor_params": [ - "--model-repository=oss://triton-model-repo/models", - "--allow-grpc=true", + "--model-repository=oss://triton-model-repo/models", + "--allow-grpc=true", "--allow-http=true" ], "metadata": { diff --git a/deploy/aws/README.md b/deploy/aws/README.md index 8e99d45c63..4e60fdd65b 100644 --- a/deploy/aws/README.md +++ b/deploy/aws/README.md @@ -1,5 +1,5 @@ + +# Instruction to create BERT engine for each Triton update + +## Description + +``` +docker run --gpus all -it --network host \ + --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \ + -v ~:/scripts nvcr.io/nvidia/tensorrt:23.11-py3 + +pip install onnx six torch tf2onnx tensorflow + +git clone -b main https://github.com/NVIDIA/TensorRT.git +cd TensorRT +git submodule update --init --recursive + +export TRT_OSSPATH=/workspace/TensorRT +export TRT_LIBPATH=/lib/x86_64-linux-gnu + +pushd /usr/local/bin && wget https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip && unzip ngccli_cat_linux.zip && chmod u+x ngc-cli/ngc && rm ngccli_cat_linux.zip ngc-cli.md5 && ln -s ngc-cli/ngc ngc && echo "no-apikey\nascii\n" | ngc config set + +popd + +cd /workspace/TensorRT/demo/BERT +bash ./scripts/download_squad.sh +bash ./scripts/download_model.sh large 128 +# bash ./scripts/download_model.sh large 384 + +mkdir -p engines + +python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/model.ckpt -o engines/bert_large_int8_bs1_s128.engine -b 1 -s 128 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/ -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/vocab.txt --int8 --fp16 --strict --calib-num 1 -iln -imh + +gsutil cp bert_large_int8_bs1_s128.engine gs://triton_sample_models/23_09/bert/1/model.plan +``` + +For each Triton upgrade, container version used to generate the model, and the model path in GCS `gs://triton_sample_models/23_09/` should be updated accordingly with the correct version. diff --git a/deploy/k8s-onprem/README.md b/deploy/k8s-onprem/README.md index 48f6a9b911..4287b23c35 100644 --- a/deploy/k8s-onprem/README.md +++ b/deploy/k8s-onprem/README.md @@ -1,5 +1,5 @@ -# Triton Inference Server Documentation - -## User Guide -The User Guide describes how to use Triton as an inference solution, including information on how to configure Triton, how to organize and configure your models, how to use the C++ and Python clients, etc. - -- [QuickStart](quickstart.md) - - [Install Triton](quickstart.md#install-triton-docker-image) - - [Create Model Repository](quickstart.md#create-a-model-repository) - - [Run Triton](quickstart.md#run-triton) -- [Model Repository](model_repository.md) - - [Cloud Storage](model_repository.md#model-repository-locations) - - [File Organization](model_repository.md#model-files) - - [Model Versioning](model_repository.md#model-versions) -- [Model Configuration](model_configuration.md) - - [Required Model Configuration](model_configuration.md#minimal-model-configuration) - - [Maximum Batch Size - Batching and Non-Batching Models](model_configuration.md#maximum-batch-size) - - [Input and Output Tensors](model_configuration.md#inputs-and-outputs) - - [Tensor Datatypes](model_configuration.md#datatypes) - - [Tensor Reshape](model_configuration.md#reshape) - - [Shape Tensor](model_configuration.md#shape-tensors) - - [Auto-Generate Required Model Configuration](model_configuration.md#auto-generated-model-configuration) - - [Version Policy](model_configuration.md#version-policy) - - [Instance Groups](model_configuration.md#instance-groups) - - [Specifying Multiple Model Instances](model_configuration.md#multiple-model-instances) - - [CPU and GPU Instances](model_configuration.md#cpu-model-instance) - - [Configuring Rate Limiter](model_configuration.md#rate-limiter-configuration) - - [Optimization Settings](model_configuration.md#optimization_policy) - - [Framework-Specific Optimization](optimization.md#framework-specific-optimization) - - [ONNX-TensorRT](optimization.md#onnx-with-tensorrt-optimization-ort-trt) - - [ONNX-OpenVINO](optimization.md#onnx-with-openvino-optimization) - - [TensorFlow-TensorRT](optimization.md#tensorflow-with-tensorrt-optimization-tf-trt) - - [TensorFlow-Mixed-Precision](optimization.md#tensorflow-automatic-fp16-optimization) - - [NUMA Optimization](optimization.md#numa-optimization) - - [Scheduling and Batching](model_configuration.md#scheduling-and-batching) - - [Default Scheduler - Non-Batching](model_configuration.md#default-scheduler) - - [Dynamic Batcher](model_configuration.md#dynamic-batcher) - - [How to Configure Dynamic Batcher](model_configuration.md#recommended-configuration-process) - - [Delayed Batching](model_configuration.md#delayed-batching) - - [Preferred Batch Size](model_configuration.md#preferred-batch-sizes) - - [Preserving Request Ordering](model_configuration.md#preserve-ordering) - - [Priority Levels](model_configuration.md#priority-levels) - - [Queuing Policies](model_configuration.md#queue-policy) - - [Ragged Batching](ragged_batching.md) - - [Sequence Batcher](model_configuration.md#sequence-batcher) - - [Stateful Models](architecture.md#stateful-models) - - [Control Inputs](architecture.md#control-inputs) - - [Implicit State - Stateful Inference Using a Stateless Model](architecture.md#implicit-state-management) - - [Sequence Scheduling Strategies](architecture.md#scheduling-strateties) - - [Direct](architecture.md#direct) - - [Oldest](architecture.md#oldest) - - [Rate Limiter](rate_limiter.md) - - [Model Warmup](model_configuration.md#model-warmup) - - [Inference Request/Response Cache](model_configuration.md#response-cache) -- Model Pipeline - - [Model Ensemble](architecture.md#ensemble-models) - - [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) -- [Model Management](model_management.md) - - [Explicit Model Loading and Unloading](model_management.md#model-control-mode-explicit) - - [Modifying the Model Repository](model_management.md#modifying-the-model-repository) -- [Metrics](metrics.md) -- [Framework Custom Operations](custom_operations.md) - - [TensorRT](custom_operations.md#tensorrt) - - [TensorFlow](custom_operations.md#tensorflow) - - [PyTorch](custom_operations.md#pytorch) - - [ONNX](custom_operations.md#onnx) -- [Client Libraries and Examples](https://github.com/triton-inference-server/client) - - [C++ HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis) - - [Python HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis) - - [Java HTTP Library](https://github.com/triton-inference-server/client/tree/main/src/java) - - GRPC Generated Libraries - - [go](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/go) - - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java) - - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript) -- [Performance Analysis](optimization.md) - - [Model Analyzer](model_analyzer.md) - - [Performance Analyzer](perf_analyzer.md) - - [Inference Request Tracing](trace.md) -- [Jetson and JetPack](jetson.md) - -## Developer Guide -The Developer Guide describes how to build and test Triton and also how Triton can be extended with new functionality. - -- [Build](build.md) -- [Protocols and APIs](inference_protocols.md). +# **Triton Inference Server Documentation** + +| [Installation](README.md#installation) | [Getting Started](README.md#getting-started) | [User Guide](README.md#user-guide) | [API Guide](protocol/README.md) | [Additional Resources](README.md#resources) | [Customization Guide](README.md#customization-guide) | +| ------------ | --------------- | --------------- | ------------ | --------------- | --------------- | + +**New to Triton Inference Server?** Make use of +[these tutorials](https://github.com/triton-inference-server/tutorials) + to begin your Triton journey! + +## **Installation** +Before you can use the Triton Docker image you must install +[Docker](https://docs.docker.com/engine/install). If you plan on using +a GPU for inference you must also install the [NVIDIA Container +Toolkit](https://github.com/NVIDIA/nvidia-docker). DGX users should +follow [Preparing to use NVIDIA +Containers](http://docs.nvidia.com/deeplearning/dgx/preparing-containers/index.html). + +Pull the image using the following command. + +``` +$ docker pull nvcr.io/nvidia/tritonserver:-py3 +``` + +Where \ is the version of Triton that you want to pull. For a complete list of all the variants and versions of the Triton Inference Server Container, visit the [NGC Page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver). More information about customizing the Triton Container can be found in [this section](customization_guide/compose.md) of the User Guide. + +## **Getting Started** + +This guide covers the simplest possible workflow for deploying a model using a Triton Inference Server. +- [Create a Model Repository](getting_started/quickstart.md#create-a-model-repository) +- [Launch Triton](getting_started/quickstart.md#launch-triton) +- [Send an Inference Request](getting_started/quickstart.md#send-an-inference-request) + +Triton Inference Server has a considerable list versatile and powerful features. All new users are recommended to explore the [User Guide](README.md#user-guide) and the [additional resources](README.md#resources) sections for features most relevant to their use case. + +## **User Guide** +The User Guide describes how to configure Triton, organize and configure your models, use the C++ and Python clients, etc. This guide includes the following: +* Creating a Model Repository [[Overview](README.md#model-repository) || [Details](user_guide/model_repository.md)] +* Writing a Model Configuration [[Overview](README.md#model-configuration) || [Details](user_guide/model_configuration.md)] +* Buillding a Model Pipeline [[Overview](README.md#model-pipeline)] +* Managing Model Availability [[Overview](README.md#model-management) || [Details](user_guide/model_management.md)] +* Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)] +* Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)] +* Using the Client API [[Overview](README.md#client-libraries-and-examples) || [Details](https://github.com/triton-inference-server/client)] +* Cancelling Inference Requests [[Overview](README.md#cancelling-inference-requests) || [Details](user_guide/request_cancellation.md)] +* Analyzing Performance [[Overview](README.md#performance-analysis)] +* Deploying on edge (Jetson) [[Overview](README.md#jetson-and-jetpack)] +* Debugging Guide [Details](./user_guide/debugging_guide.md) + +### Model Repository +[Model Repositories](user_guide/model_repository.md) are the organizational hub for using Triton. All models, configuration files, and additional resources needed to serve the models are housed inside a model repository. +- [Cloud Storage](user_guide/model_repository.md#model-repository-locations) +- [File Organization](user_guide/model_repository.md#model-files) +- [Model Versioning](user_guide/model_repository.md#model-versions) +### Model Configuration + +A [Model Configuration](user_guide/model_configuration.md) file is where you set the model-level options, such as output tensor reshaping and dynamic batch sizing. + +#### Required Model Configuration + +Triton Inference Server requires some [Minimum Required parameters](user_guide/model_configuration.md#minimal-model-configuration) to be filled in the Model Configuration. These required parameters essentially pertain to the structure of the model. For TensorFlow, ONNX and TensorRT models, users can rely on Triton to [Auto Generate](user_guide/model_configuration.md#auto-generated-model-configuration) the Minimum Required model configuration. +- [Maximum Batch Size - Batching and Non-Batching Models](user_guide/model_configuration.md#maximum-batch-size) +- [Input and Output Tensors](user_guide/model_configuration.md#inputs-and-outputs) + - [Tensor Datatypes](user_guide/model_configuration.md#datatypes) + - [Tensor Reshape](user_guide/model_configuration.md#reshape) + - [Shape Tensor](user_guide/model_configuration.md#shape-tensors) + +#### Versioning Models +Users need the ability to save and serve different versions of models based on business requirements. Triton allows users to set policies to make available different versions of the model as needed. [Learn More](user_guide/model_configuration.md#version-policy). + +#### Instance Groups +Triton allows users to use of multiple instances of the same model. Users can specify how many instances (copies) of a model to load and whether to use GPU or CPU. If the model is being loaded on GPU, users can also select which GPUs to use. [Learn more](user_guide/model_configuration.md#instance-groups). +- [Specifying Multiple Model Instances](user_guide/model_configuration.md#multiple-model-instances) +- [CPU and GPU Instances](user_guide/model_configuration.md#cpu-model-instance) +- [Configuring Rate Limiter](user_guide/model_configuration.md#rate-limiter-configuration) + +#### Optimization Settings + +The Model Configuration ModelOptimizationPolicy property is used to specify optimization and prioritization settings for a model. These settings control if/how a model is optimized by the backend and how it is scheduled and executed by Triton. See the [ModelConfig Protobuf](https://github.com/triton-inference-server/common/blob/main/protobuf/model_config.proto) and [Optimization Documentation](user_guide/optimization.md#optimization) for the currently available settings. +- [Framework-Specific Optimization](user_guide/optimization.md#framework-specific-optimization) + - [ONNX-TensorRT](user_guide/optimization.md#onnx-with-tensorrt-optimization-ort-trt) + - [ONNX-OpenVINO](user_guide/optimization.md#onnx-with-openvino-optimization) + - [TensorFlow-TensorRT](user_guide/optimization.md#tensorflow-with-tensorrt-optimization-tf-trt) + - [TensorFlow-Mixed-Precision](user_guide/optimization.md#tensorflow-automatic-fp16-optimization) +- [NUMA Optimization](user_guide/optimization.md#numa-optimization) + +#### Scheduling and Batching + +Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's [Batcher and Scheduler](user_guide/model_configuration.md#scheduling-and-batching). +- [Default Scheduler - Non-Batching](user_guide/model_configuration.md#default-scheduler) +- [Dynamic Batcher](user_guide/model_configuration.md#dynamic-batcher) + - [How to Configure Dynamic Batcher](user_guide/model_configuration.md#recommended-configuration-process) + - [Delayed Batching](user_guide/model_configuration.md#delayed-batching) + - [Preferred Batch Size](user_guide/model_configuration.md#preferred-batch-sizes) + - [Preserving Request Ordering](user_guide/model_configuration.md#preserve-ordering) + - [Priority Levels](user_guide/model_configuration.md#priority-levels) + - [Queuing Policies](user_guide/model_configuration.md#queue-policy) + - [Ragged Batching](user_guide/ragged_batching.md) +- [Sequence Batcher](user_guide/model_configuration.md#sequence-batcher) + - [Stateful Models](user_guide/architecture.md#stateful-models) + - [Control Inputs](user_guide/architecture.md#control-inputs) + - [Implicit State - Stateful Inference Using a Stateless Model](user_guide/architecture.md#implicit-state-management) + - [Sequence Scheduling Strategies](user_guide/architecture.md#scheduling-strategies) + - [Direct](user_guide/architecture.md#direct) + - [Oldest](user_guide/architecture.md#oldest) + +#### Rate Limiter +Rate limiter manages the rate at which requests are scheduled on model instances by Triton. The rate limiter operates across all models loaded in Triton to allow cross-model prioritization. [Learn more](user_guide/rate_limiter.md). + +#### Model Warmup +For a few of the Backends (check [Additional Resources](README.md#resources)) some or all of initialization is deferred until the first inference request is received, the benefit is resource conservation but comes with the downside of the initial requests getting processed slower than expected. Users can pre-"warm up" the model by instructing Triton to initialize the model. [Learn more](user_guide/model_configuration.md#model-warmup). + +#### Inference Request/Response Cache +Triton has a feature which allows inference responses to get cached. [Learn More](user_guide/response_cache.md). + +### Model Pipeline +Building ensembles is as easy as adding an addition configuration file which outlines the specific flow of tensors from one model to another. Any additional changes required by the model ensemble can be made in existing (individual) model configurations. +- [Model Ensemble](user_guide/architecture.md#ensemble-models) +- [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) +### Model Management +Users can specify policies in the model configuration for loading and unloading of models. This [section](user_guide/model_management.md) covers user selectable policy details. +- [Explicit Model Loading and Unloading](user_guide/model_management.md#model-control-mode-explicit) +- [Modifying the Model Repository](user_guide/model_management.md#modifying-the-model-repository) +### Metrics +Triton provides Prometheus metrics like GPU Utilization, Memory Usage, Latency and more. Learn about [available metrics](user_guide/metrics.md). +### Framework Custom Operations +Some frameworks provide the option of building custom layers/operations. These can be added to specific Triton Backends for the those frameworks. [Learn more](user_guide/custom_operations.md) +- [TensorRT](user_guide/custom_operations.md#tensorrt) +- [TensorFlow](user_guide/custom_operations.md#tensorflow) +- [PyTorch](user_guide/custom_operations.md#pytorch) +- [ONNX](user_guide/custom_operations.md#onnx) +### Client Libraries and Examples +Use the [Triton Client](https://github.com/triton-inference-server/client) API to integrate client applications over the network HTTP/gRPC API or integrate applications directly with Triton using CUDA shared memory to remove network overhead. +- [C++ HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis) +- [Python HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis) +- [Java HTTP Library](https://github.com/triton-inference-server/client/tree/main/src/java) +- GRPC Generated Libraries + - [go](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/go) + - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java) + - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript) +- [Shared Memory Extension](protocol/extension_shared_memory.md) +### Cancelling Inference Requests +Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature. +### Performance Analysis +Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment. +- [Performance Tuning Guide](user_guide/performance_tuning.md) +- [Optimization](user_guide/optimization.md) +- [Model Analyzer](user_guide/model_analyzer.md) +- [Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) +- [Inference Request Tracing](user_guide/trace.md) +### Jetson and JetPack +Triton can be deployed on edge devices. Explore [resources](user_guide/jetson.md) and [examples](examples/jetson/README.md). + +## **Resources** + +The following resources are recommended to explore the full suite of Triton Inference Server's functionalities. +- **Clients**: Triton Inference Server comes with C++, Python and Java APIs with which users can send HTTP/REST or gRPC(possible extensions for other languages) requests. Explore the [client repository](https://github.com/triton-inference-server/server/tree/main/docs/protocol) for examples and documentation. + +- **Configuring Deployment**: Triton comes with three tools which can be used to configure deployment setting, measure performance and recommend optimizations. + - [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) Model Analyzer is CLI tool built to recommend deployment configurations for Triton Inference Server based on user's Quality of Service Requirements. It also generates detailed reports about model performance to summarize the benefits and trade offs of different configurations. + - [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md): + Perf Analyzer is a CLI application built to generate inference requests and + measures the latency of those requests and throughput of the model being + served. + - [Model Navigator](https://github.com/triton-inference-server/model_navigator): + The Triton Model Navigator is a tool that provides the ability to automate the process of moving model from source to optimal format and configuration for deployment on Triton Inference Server. The tool supports export model from source to all possible formats and applies the Triton Inference Server backend optimizations. + +- **Backends**: Triton supports a wide variety of frameworks used to run models. Users can extend this functionality by creating custom backends. + - [PyTorch](https://github.com/triton-inference-server/pytorch_backend): Widely used Open Source DL Framework + - [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend): Widely used Open Source DL Framework + - [TensorRT](https://github.com/triton-inference-server/tensorrt_backend): NVIDIA [TensorRT](https://developer.nvidia.com/tensorrt) is an inference acceleration SDK that provide a with range of graph optimizations, kernel optimization, use of lower precision, and more. + - [ONNX](https://github.com/triton-inference-server/onnxruntime_backend): ONNX Runtime is a cross-platform inference and training machine-learning accelerator. + - [OpenVINO](https://github.com/triton-inference-server/openvino_backend): OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. + - [Paddle Paddle](https://github.com/triton-inference-server/paddlepaddle_backend): Widely used Open Source DL Framework + - [Python](https://github.com/triton-inference-server/python_backend): Users can add custom business logic, or any python code/model for serving requests. + - [Forest Inference Library](https://github.com/triton-inference-server/fil_backend): Backend built for forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML) + - [DALI](https://github.com/triton-inference-server/dali_backend): NVIDIA [DALI](https://developer.nvidia.com/dali) is a Data Loading Library purpose built to accelerated pre-processing and data loading steps in a Deep Learning Pipeline. + - [HugeCTR](https://github.com/triton-inference-server/hugectr_backend): HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates + - [Managed Stateful Models](https://github.com/triton-inference-server/stateful_backend): This backend automatically manages the input and output states of a model. The states are associated with a sequence id and need to be tracked for inference requests associated with the sequence id. + - [Faster Transformer](https://github.com/triton-inference-server/fastertransformer_backend): NVIDIA [FasterTransformer](https://github.com/NVIDIA/FasterTransformer/) (FT) is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner. + - [Building Custom Backends](https://github.com/triton-inference-server/backend/tree/main/examples#tutorial) + - [Sample Custom Backend: Repeat_backend](https://github.com/triton-inference-server/repeat_backend): Backend built to demonstrate sending of zero, one, or multiple responses per request. + +## **Customization Guide** +This guide describes how to build and test Triton and also how Triton can be extended with new functionality. + +- [Build](customization_guide/build.md) +- [Protocols and APIs](customization_guide/inference_protocols.md). - [Backends](https://github.com/triton-inference-server/backend) -- [Repository Agents](repository_agents.md) -- [Test](test.md) +- [Repository Agents](customization_guide/repository_agents.md) +- [Test](customization_guide/test.md) diff --git a/docs/_static/.gitattributes b/docs/_static/.gitattributes new file mode 100644 index 0000000000..04865f126a --- /dev/null +++ b/docs/_static/.gitattributes @@ -0,0 +1,2 @@ +nvidia-logo-horiz-rgb-blk-for-screen.png filter=lfs diff=lfs merge=lfs -text +nvidia-logo-vert-rgb-blk-for-screen.png filter=lfs diff=lfs merge=lfs -text diff --git a/docs/_static/custom.css b/docs/_static/custom.css new file mode 100644 index 0000000000..46bab57d4e --- /dev/null +++ b/docs/_static/custom.css @@ -0,0 +1,319 @@ +/* +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/5/2/52891dda673228d54e5d57bf1e4a3880d4b22405.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/e/0/e090b7dda7a582522c7f9045c6ce949cce60134f.woff) format("woff"); + font-weight: 300; + font-style: normal; +} +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/a/1/a107baabcbf6b241099122336bce7429bcfd377a.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/3/a/3a6060a4e3bce70e5552ba0de8af4b22c6cf9144.woff) format("woff"); + font-weight: 300; + font-style: italic; +} +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/9/9/9920d2b172b01d92fc9c1c0e521dcf45b59c47c3.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/6/c/6c7d947928a7e4ef3e80ed409bef6c243f2148cb.woff) format("woff"); + font-weight: 400; + font-style: normal; +} +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/e/8/e8e63fe1244372cd942d957f44a5616a1eba0644.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/0/f/0f1fb2af0283ab09d36e7097bb07d895c3228f12.woff) format("woff"); + font-weight: 400; + font-style: italic; +} +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/7/9/79d3c513a9cd72c59f65354f39f89ca52dc17dd2.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/2/5/2581ac533f5d01f4985d8a7245b0766b4630ced8.woff) format("woff"); + font-weight: 500; + font-style: normal; +} +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/3/9/39d9ef1ee9770dd503f19bb2ace2fdb4eff3bb50.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/7/b/7bb5d5e2e71b2e13c8098b2e67c0a0ed9258e6c7.woff) format("woff"); + font-weight: 500; + font-style: italic; +} +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/0/5/05276a55a43eb3f74981ec1e93252727afcd9d16.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/9/c/9cfec7ed941b06564aa4d5ca14610e81542d070f.woff) format("woff"); + font-weight: 700; + font-style: normal; +} +@font-face { + font-family: "NVIDIA Sans"; + src: url(https://aws1.discourse-cdn.com/nvidia/original/3X/a/e/aebd14d09ba56f541e1b8735fb051e33710f9ae7.woff2) format("woff2"), + url(https://aws1.discourse-cdn.com/nvidia/original/3X/e/d/edbdabef43acc5c12e84a94baaa5542c9404cfeb.woff) format("woff"); + font-weight: 700; + font-style: italic; +} + +/* Custom Styles */ +:root { +--pst-font-size-base: none; +--pst-color-primary: 0, 133, 197; +--pst-color-admonition-note: var(--pst-color-primary); +--pst-color-admonition-default: var(--pst-color-primary); +--pst-color-info: 255, 193, 7; +--pst-color-admonition-tip: var(--pst-color-info); +--pst-color-admonition-hint: var(--pst-color-info); +--pst-color-admonition-important: var(--pst-color-info); +--pst-color-warning: 245, 162, 82; +--pst-color-danger: 230, 101, 129; +--pst-color-admonition-warning: var(--pst-color-danger); +--pst-color-link: 118, 185, 0; +--pst-color-inline-code: 92, 22, 130; +--font-family-sans-serif: NVIDIA Sans, Helvetica, Arial, Sans-serif; +--pst-font-family-base-system: NVIDIA Sans, Helvetica, Arial, Sans-serif; +font-family: NVIDIA Sans, Helvetica, Arial, Sans-serif; +} + +.prev-next-area { + font-size: small; +} + +.docutils caption { + caption-side: top; +} + +#site-navigation h1.site-logo { + font-size: 0.85em; +} + +/* colors +nv green 118,185,0 +black 0, 0, 0 +light gray 205, 205, 205 +medium gray 140, 140, 140 +dark gray 94, 94, 94 + +emerald 0, 133, 100 +emerald #008564 +amethyst 92, 22, 130 +amethyst #5C1682 +cpu blue 0, 133, 197 +cpu blue #0085C5 +garnet 137, 12, 88 +garnet 890C58 +fluorite 250, 194, 0 +fluorite FAC200 +*/ + +:root { + --nv-green: #76b900; + --nv-green-darken: #6ead00; + --emerald: #008564; + --emerald-darken: #017c5d; + --amethyst: #5d1682; + --amethyst-darken: #4c116b; + --cpu-blue: #0071c5; + --cpu-blue-darken: #0062ad; + --garnet: #890c58; + --garnet-darken: #7a0c4e; + --fluorite: #fac200; + --fluorite-darken: #e4b301; + --dark-gray: #5e5e5e; + --light-gray: #cdcdcd; + --medium-gray: #8c8c8c; + --medium-gray-darken: #8c8c8cde; + --primary: #76b900; + --secondary: #008564; + --success: #5d1682; + --info: #0071c5; + --warning: #fac200; + --danger: #890c58; +} + +/* Riva TBYB (ASR and TTS) Styling */ +.demo-box { + background-color: rgb(245,245,245); +} +a:link { text-decoration: none; } +.scrollable { + height: 125px; + overflow-y: auto; + font-size: 1.3rem; +} +.dot { + height: 8px; + width: 8px; + background-color: rgb(228, 77, 77); + border-radius: 50%; + display: inline-block; +} +.timer { + font-size: 80%; + text-transform: uppercase; + white-space: nowrap; +} +.form-select { + border-radius: 0%; + font-size: 80%; +} +.form-control { + border-radius: 0%; +} +.input-group-text { + border-radius: 0%; + font-size: 80%; + text-transform: uppercase; + background-color: rgb(245,245,245); +} +.card { + border-radius: 0%; +} +.speech-control { + border-top-width: 0px; +} +.btn { + border-radius: 0%; + font-size: 80%; + text-transform: uppercase; + white-space: nowrap; + min-width: 125px; +} +.btn-primary { + background-color: var(--nv-green); + border-color: var(--nv-green); +} +.btn-primary:hover { + background-color: var(--nv-green-darken); + border-color: var(--nv-green-darken); +} +.btn-primary:focus, .btn-primary.focus { + background-color: var(--nv-green-darken); + border-color: var(--nv-green-darken); + -webkit-box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5); + box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5); +} +.btn-primary.disabled, .btn-primary:disabled { + background-color: var(--nv-green); + border-color: var(--nv-green); +} +.btn-primary:not(:disabled):not(.disabled):active, .btn-primary:not(:disabled):not(.disabled).active, +.show > .btn-primary.dropdown-toggle { + background-color: var(--nv-green-darken); + border-color: var(--nv-green-darken); +} +.btn-primary:not(:disabled):not(.disabled):active:focus, .btn-primary:not(:disabled):not(.disabled).active:focus, +.show > .btn-primary.dropdown-toggle:focus { + -webkit-box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5); + box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5); +} +.btn-secondary { + background-color: var(--medium-gray); + border-color: var(--medium-gray); +} +.btn-secondary:hover { + background-color: var(--medium-gray-darken); + border-color: var(--medium-gray-darken); +} +.btn-secondary:focus, .btn-secondary.focus { + background-color: var(--medium-gray-darken); + border-color: var(--medium-gray-darken); + -webkit-box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5); + box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5); +} +.btn-secondary.disabled, .btn-secondary:disabled { + background-color: var(--medium-gray); + border-color: var(--medium-gray); +} +.btn-secondary:not(:disabled):not(.disabled):active, .btn-secondary:not(:disabled):not(.disabled).active, +.show > .btn-secondary.dropdown-toggle { + background-color: var(--medium-gray-darken); + border-color: var(--medium-gray-darken); +} +.btn-secondary:not(:disabled):not(.disabled):active:focus, .btn-secondary:not(:disabled):not(.disabled).active:focus, +.show > .btn-secondary.dropdown-toggle:focus { + -webkit-box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5); + box-shadow: 0 0 0 0.2rem rgba(140, 140, 140, 0.5); +} +.btn-link { + color: var(--nv-green); + text-decoration-line: none; +} +.btn-link:hover { + color: var(--nv-green-darken); +} +.btn-link:focus, .btn-link.focus { + color: var(--nv-green-darken); + -webkit-box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5); + box-shadow: 0 0 0 0.2rem rgba(147, 173, 102, 0.5); +} +.link-primary { + color: var(--nv-green); +} +.link-primary:hover { + color: var(--nv-green-darken); +} + +/* Riva ASR Styles */ +#riva-upload-label { + margin-top: 0.5rem; +} + +/* Riva TTS Styles */ +.tts-control { + justify-content: space-between; + align-items: center; +} + +.tts-control > p { + margin: unset; +} + +#riva-tts-field { + resize: none; + border: unset; + padding: 0; + height: 100%; + font-size: 1.0rem; +} + +#riva-terms-of-use p { + max-width: 620px; +} + +/* Media Queries */ +@media (max-width: 1024px) { + + /* Riva TTS and ASR */ + .scrollable { + height: 250px; + } +} + diff --git a/docs/_static/logo_2color_horizontal.svg b/docs/_static/logo_2color_horizontal.svg new file mode 100644 index 0000000000..5ab0442d32 --- /dev/null +++ b/docs/_static/logo_2color_horizontal.svg @@ -0,0 +1,2 @@ + + diff --git a/docs/_static/logo_2color_vertical.svg b/docs/_static/logo_2color_vertical.svg new file mode 100644 index 0000000000..69e64b7001 --- /dev/null +++ b/docs/_static/logo_2color_vertical.svg @@ -0,0 +1,2 @@ + + diff --git a/docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png b/docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png new file mode 100644 index 0000000000..6316a9340f --- /dev/null +++ b/docs/_static/nvidia-logo-horiz-rgb-blk-for-screen.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:dd57ffce985e08c97c6af5fdadd2a28e4a92996455edc2d0598dd964cca51eae +size 48928 diff --git a/docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png b/docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png new file mode 100644 index 0000000000..5546c1b57d --- /dev/null +++ b/docs/_static/nvidia-logo-vert-rgb-blk-for-screen.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:17a25111e145aa52b77ec5a89eb3b0c7d9a2a90dea25a0bb867a937514fc783c +size 63541 diff --git a/docs/_static/rtd-data.js b/docs/_static/rtd-data.js new file mode 100644 index 0000000000..7ed13e8ee0 --- /dev/null +++ b/docs/_static/rtd-data.js @@ -0,0 +1,36 @@ +/* +# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +// Dummy data for testing ReadTheDocs footer insertion +// This mimics RTD data for a project that uses both versions + languages +var READTHEDOCS_DATA = { + project: "frc-docs", + version: "latest", + language: "en", + proxied_api_host: "https://readthedocs.org", +}; diff --git a/docs/_templates/layout.html b/docs/_templates/layout.html new file mode 100644 index 0000000000..570aba8ba3 --- /dev/null +++ b/docs/_templates/layout.html @@ -0,0 +1,31 @@ + +{% extends "!layout.html" %} +{%- block footer %} + +{%- endblock %} diff --git a/docs/conf.py b/docs/conf.py new file mode 100755 index 0000000000..9378329752 --- /dev/null +++ b/docs/conf.py @@ -0,0 +1,256 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +import os + +from docutils import nodes +from sphinx import search + +# import sys +# sys.path.insert(0, os.path.abspath('.')) + +# -- Project information ----------------------------------------------------- + +project = "NVIDIA Triton Inference Server" +copyright = "2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved" +author = "NVIDIA" + +# The full version, including alpha/beta/rc tags +# Env only set during riva-release process, otherwise keep as dev for all internal builds +release = os.getenv("TRITON_VERSION", "dev") + +# maintain left-side bar toctrees in `contents` file +# so it doesn't show up needlessly in the index page +master_doc = "contents" + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + "ablog", + "myst_nb", + "sphinx_copybutton", + "sphinx_design", + "sphinx-prompt", + # "sphinxcontrib.bibtex", + "sphinx_tabs.tabs", + "sphinx_sitemap", +] + +suppress_warnings = ["myst.domains", "ref.ref"] + +numfig = True + +# final location of docs for seo/sitemap +html_baseurl = ( + "https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/" +) + +myst_enable_extensions = [ + "dollarmath", + "amsmath", + "deflist", + # "html_admonition", + # "html_image", + "colon_fence", + # "smartquotes", + "replacements", + # "linkify", + "substitution", +] +myst_heading_anchors = 5 + +# Add any paths that contain templates here, relative to this directory. +templates_path = ["_templates"] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = ["README.md"] + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = "sphinx_book_theme" +html_logo = "_static/nvidia-logo-horiz-rgb-blk-for-screen.png" +html_title = "NVIDIA Triton Inference Server" +html_short_title = "Triton" +html_copy_source = True +html_sourcelink_suffix = "" +html_favicon = "_static/nvidia-logo-vert-rgb-blk-for-screen.png" +html_last_updated_fmt = "" +html_additional_files = ["index.html"] + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ["_static"] +html_css_files = ["custom.css"] + +html_theme_options = { + "path_to_docs": "docs", + # "launch_buttons": { + # "binderhub_url": "https://mybinder.org", + # "colab_url": "https://colab.research.google.com/", + # "deepnote_url": "https://deepnote.com/", + # "notebook_interface": "jupyterlab", + # "thebe": True, + # # "jupyterhub_url": "https://datahub.berkeley.edu", # For testing + # }, + "use_edit_page_button": False, + "use_issues_button": True, + "use_repository_button": True, + "use_download_button": False, + "logo_only": False, + "show_toc_level": 2, + "extra_navbar": "", + "extra_footer": "", + "repository_url": "https://github.com/triton-inference-server/server", + "use_repository_button": True, +} + +version_short = release +deploy_ngc_org = "nvidia" +deploy_ngc_team = "triton" +myst_substitutions = { + "VersionNum": version_short, + "deploy_ngc_org_team": f"{deploy_ngc_org}/{deploy_ngc_team}" + if deploy_ngc_team + else deploy_ngc_org, +} + + +def ultimateReplace(app, docname, source): + result = source[0] + for key in app.config.ultimate_replacements: + result = result.replace(key, app.config.ultimate_replacements[key]) + source[0] = result + + +# this is a necessary hack to allow us to fill in variables that exist in code blocks +ultimate_replacements = { + "{VersionNum}": version_short, + "{SamplesVersionNum}": version_short, + "{NgcOrgTeam}": f"{deploy_ngc_org}/{deploy_ngc_team}" + if deploy_ngc_team + else deploy_ngc_org, +} + +# bibtex_bibfiles = ["references.bib"] +# To test that style looks good with common bibtex config +# bibtex_reference_style = "author_year" +# bibtex_default_style = "plain" + +### We currently use Myst: https://myst-nb.readthedocs.io/en/latest/use/execute.html +jupyter_execute_notebooks = "off" # Global execution disable +# execution_excludepatterns = ['tutorials/tts-python-basics.ipynb'] # Individual notebook disable + + +def setup(app): + app.add_config_value("ultimate_replacements", {}, True) + app.connect("source-read", ultimateReplace) + app.add_js_file("https://js.hcaptcha.com/1/api.js") + + visitor_script = ( + "//assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js" + ) + + if visitor_script: + app.add_js_file(visitor_script) + + # if not os.environ.get("READTHEDOCS") and not os.environ.get("GITHUB_ACTIONS"): + # app.add_css_file( + # "https://assets.readthedocs.org/static/css/readthedocs-doc-embed.css" + # ) + # app.add_css_file("https://assets.readthedocs.org/static/css/badge_only.css") + + # # Create the dummy data file so we can link it + # # ref: https://github.com/readthedocs/readthedocs.org/blob/bc3e147770e5740314a8e8c33fec5d111c850498/readthedocs/core/static-src/core/js/doc-embed/footer.js # noqa: E501 + # app.add_js_file("rtd-data.js") + # app.add_js_file( + # "https://assets.readthedocs.org/static/javascript/readthedocs-doc-embed.js", + # priority=501, + # ) + + +# Patch for sphinx.search stemming short terms (i.e. tts -> tt) +# https://github.com/sphinx-doc/sphinx/blob/4.5.x/sphinx/search/__init__.py#L380 +def sphinxSearchIndexFeed( + self, docname: str, filename: str, title: str, doctree: nodes.document +): + """Feed a doctree to the index.""" + self._titles[docname] = title + self._filenames[docname] = filename + + visitor = search.WordCollector(doctree, self.lang) + doctree.walk(visitor) + + # memoize self.lang.stem + def stem(word: str) -> str: + try: + return self._stem_cache[word] + except KeyError: + self._stem_cache[word] = self.lang.stem(word).lower() + return self._stem_cache[word] + + _filter = self.lang.word_filter + + for word in visitor.found_title_words: + stemmed_word = stem(word) + if len(stemmed_word) > 3 and _filter(stemmed_word): + self._title_mapping.setdefault(stemmed_word, set()).add(docname) + elif _filter(word): # stemmer must not remove words from search index + self._title_mapping.setdefault(word.lower(), set()).add(docname) + + for word in visitor.found_words: + stemmed_word = stem(word) + # again, stemmer must not remove words from search index + if len(stemmed_word) <= 3 or not _filter(stemmed_word) and _filter(word): + stemmed_word = word.lower() + already_indexed = docname in self._title_mapping.get(stemmed_word, set()) + if _filter(stemmed_word) and not already_indexed: + self._mapping.setdefault(stemmed_word, set()).add(docname) + + +search.IndexBuilder.feed = sphinxSearchIndexFeed diff --git a/docs/contents.md b/docs/contents.md new file mode 100644 index 0000000000..ca952fed2c --- /dev/null +++ b/docs/contents.md @@ -0,0 +1,104 @@ + + +```{toctree} +:maxdepth: 1 +:caption: Getting Started + +getting_started/quickstart +``` + +```{toctree} +:maxdepth: 1 +:caption: User Guide + +user_guide/performance_tuning +user_guide/architecture +user_guide/model_repository +customization_guide/repository_agents +user_guide/model_configuration +user_guide/request_cancellation +user_guide/optimization +user_guide/ragged_batching +user_guide/rate_limiter +user_guide/model_analyzer +user_guide/perf_analyzer +user_guide/model_management +user_guide/custom_operations +user_guide/decoupled_models +user_guide/response_cache +user_guide/metrics +user_guide/trace +user_guide/jetson +user_guide/v1_to_v2 +customization_guide/deploy +``` + +```{toctree} +:maxdepth: 1 +:caption: Debugging + +user_guide/debugging_guide +user_guide/faq +``` + +```{toctree} +:maxdepth: 1 +:caption: Protocol Guides + +protocol/README.md +customization_guide/inference_protocols +protocol/extension_binary_data +protocol/extension_classification +protocol/extension_generate +protocol/extension_logging +protocol/extension_model_configuration +protocol/extension_model_repository +protocol/extension_schedule_policy +protocol/extension_sequence +protocol/extension_shared_memory +protocol/extension_statistics +protocol/extension_trace +``` + +```{toctree} +:maxdepth: 1 +:caption: Customization Guide + +customization_guide/build +customization_guide/compose +customization_guide/test +``` + +```{toctree} +:maxdepth: 1 +:caption: Examples + +examples/jetson/README +examples/jetson/concurrency_and_dynamic_batching/README +``` diff --git a/docs/build.md b/docs/customization_guide/build.md similarity index 90% rename from docs/build.md rename to docs/customization_guide/build.md index d64cceb4cc..40f8f00c76 100644 --- a/docs/build.md +++ b/docs/customization_guide/build.md @@ -1,5 +1,5 @@ + +# Secure Deployment Considerations + +The Triton Inference Server project is designed for flexibility and +allows developers to create and deploy inferencing solutions in a +variety of ways. Developers can deploy Triton as an http server, a +grpc server, a server supporting both, or embed a Triton server into +their own application. Developers can deploy Triton locally or in the +cloud, within a Kubernetes cluster behind an API gateway or as a +standalone process. This guide is intended to provide some key points +and best practices that users deploying Triton based solutions should +consider. + +| [Deploying Behind a Secure Gateway or Proxy](#deploying-behind-a-secure-proxy-or-gateway) | [Running with Least Privilege](#running-with-least-privilege) | + +> [!IMPORTANT] +> Ultimately the security of a solution based on Triton +> is the responsibility of the developer building and deploying that +> solution. When deploying in production settings please have security +> experts review any potential risks and threats. + +> [!WARNING] +> Dynamic updates to model repositories are disabled by +> default. Enabling dynamic updates to model repositories either +> through model loading APIs or through directory polling can lead to +> arbitrary code execution. Model repository access control is +> critical in production deployments. If dynamic updates are required, +> ensure only trusted entities have access to model loading APIs and +> model repository directories. + +## Deploying Behind a Secure Proxy or Gateway + +The Triton Inference Server is designed primarily as a microservice to +be deployed as part of a larger solution within an application +framework or service mesh. + +In such deployments it is typical to utilize dedicated gateway or +proxy servers to handle authorization, access control, resource +management, encryption, load balancing, redundancy and many other +security and availability features. + +The full design of such systems is outside the scope of this +deployment guide but in such scenarios dedicated ingress controllers +handle access from outside the trusted network while Triton Inference +Server handles only trusted, validated requests. + +In such scenarios Triton Inference Server is not exposed directly to +an untrusted network. + +### References on Secure Deployments + +In the following references, Triton Inference Server would be deployed +as an "Application" or "Service" within the trusted internal network. + +* [https://www.nginx.com/blog/architecting-zero-trust-security-for-kubernetes-apps-with-nginx/] +* [https://istio.io/latest/docs/concepts/security/] +* [https://konghq.com/blog/enterprise/envoy-service-mesh] +* [https://www.solo.io/topics/envoy-proxy/] + +## Running with Least Privilege + + The security principle of least privilege advocates that a process be + granted the minimum permissions required to do its job. + + For an inference solution based on Triton Inference Server there are a + number of ways to reduce security risks by limiting the permissions + and capabilities of the server to the minimum required for correct + operation. + +### 1. Follow Best Practices for Securing Kubernetes Deployments + + When deploying Triton within a Kubernetes pod ensure that it is + running with a service account with the fewest possible + permissions. Ensure that you have configured [role based access + control](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) + to limit access to resources and capabilities as required by your + application. + +### 2. Follow Best Practices for Launching Standalone Docker Containers + + When Triton is deployed as a containerized service, standard docker + security practices apply. This includes limiting the resources that a + container has access to as well as limiting network access to the + container. https://docs.docker.com/engine/security/ + +### 3. Run as a Non-Root User + + Triton's pre-built containers contain a non-root user that can be used + to launch the tritonserver application with limited permissions. This + user, `triton-server` is created with `user id 1000`. When launching + the container using docker the user can be set with the `--user` + command line option. + +##### Example Launch Command + + ``` + docker run --rm --user triton-server -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:YY.MM-py3 tritonserver --model-repository=/models + ``` + +### 4. Restrict or Disable Access to Protocols and APIs + +The pre-built Triton Inference Serrver application enables a full set +of features including health checks, server metadata, inference apis, +shared memory apis, model and model repository configuration, +statistics, tracing and logging. Care should be taken to only expose +those capabilities that are required for your solution. + +#### Disabling Features at Compile Time + +When building a custom inference server application features can be +selectively enabled or disabled using the `build.py` script. As an +example a developer can use the flags `--endpoint http` and +`--endpoint grpc` to compile support for `http`, `grpc` or +both. Support for individual backends can be enabled as well. For more +details please see [documentation](build.md) on building a custom +inference server application. + +#### Disabling / Restricting Features at Run Time + +The `tritonserver` application provides a number of command line +options to enable and disable features when launched. For a full list +of options please see `tritonserver --help`. The following subset are +described here with basic recommendations. + +##### `--exit-on-error , default True` + +Exits the inference server if any error occurs during +initialization. Recommended to set to `True` to catch any +unanticipated errors. + +##### `--disable-auto-complete-config, default enabled` + +Disables backends from autocompleting model configuration. If not +required for your solution recommended to disable to ensure model +configurations are defined statically. + +##### `--strict-readiness , default True` + +If set to true `/v2/health/ready` will only report ready when all +selected models are loaded. Recommended to set to `True` to provide a +signal to other services and orchestration frameworks when full +initialization is complete and server is healthy. + +##### `--model-control-mode , default "none"` + +Specifies the mode for model management. + +> [!WARNING] +> Allowing dynamic updates to the model repository can lead +> to arbitrary code execution. Model repository access control is +> critical in production deployments. Unless required for operation, it's recommended +> to disable dynamic updates. If required, please ensure only trusted entities +> can add or remove models from a model repository. + +Options: + + * `none`- Models are loaded at start up and can not be modified. + * `poll`- Server process will poll the model repository for changes. + * `explicit` - Models can be loaded and unloaded via the model control APIs. + +Recommended to set to `none` unless dynamic updates are required. If +dynamic updates are required care must be taken to control access to +the model repository files and load and unload APIs. + +##### `--allow-http , default True` + +Enable HTTP request handling. Recommended to set to `False` if not required. + +##### `--allow-grpc , default True` + +Enable gRPC request handling. Recommended to set to `False` if not required. + +##### `--grpc-use-ssl default False` + +Use SSL authentication for gRPC requests. Recommended to set to `True` if service is not protected by a gateway or proxy. + +##### `--grpc-use-ssl-mutual default False` + +Use mutual SSL authentication for gRPC requests. Recommended to set to `True` if service is not protected by a gateway or proxy. + +##### `--grpc-restricted-protocol <:=>` + +Restrict access to specific gRPC protocol categories to users with +specific key, value pair shared secret. See +[limit-endpoint-access](inference_protocols.md#limit-endpoint-access-beta) +for more information. + +> [!Note] +> Restricting access can be used to limit exposure to model +> control APIs to trusted users. + +##### `--http-restricted-api <:=>` + +Restrict access to specific HTTP API categories to users with +specific key, value pair shared secret. See +[limit-endpoint-access](inference_protocols.md#limit-endpoint-access-beta) +for more information. + +> [!Note] +> Restricting access can be used to limit exposure to model +> control APIs to trusted users. + +##### `--allow-sagemaker default False` + +Enable Sagemaker request handling. Recommended to set to `False` unless required. + +##### `--allow-vertex-ai default depends on environment variable` + +Enable Vertex AI request handling. Default is `True` if +`AIP_MODE=PREDICTION`, `False` otherwise. Recommended to set to +`False` unless required. + +##### `--allow-metrics default True` + +Allow server to publish prometheus style metrics. Recommended to set +to `False` if not required to avoid capturing or exposing any sensitive information. + +#### `--trace-config level= default "off"` + +Tracing mode. Trace mode supports `triton` and `opentelemetry`. Unless required `--trace-config level=off` should be set to avoid capturing or exposing any sensitive information. + + +##### `backend-directory default /opt/tritonserver/backends` + +Directory where backend shared libraries are found. + +> [!Warning] +> Access to add or remove files from the backend directory +> must be access controlled. Adding untrusted files +> can lead to arbitrarty code execution. + +##### `repoagent-directory default /opt/tritonserver/repoagents` +Directory where repository agent shared libraries are found. + +> [!Warning] +> Access to add or remove files from the repoagent directory +> must be access controlled. Adding untrusted files +> can lead to arbitrarty code execution. + +##### `cache-directory default /opt/tritonserver/caches` + +Directory where cache shared libraries are found. + +> [!Warning] +> Access to add or remove files from the cache directory +> must be access controlled. Adding untrusted files +> can lead to arbitrarty code execution. + + + + + diff --git a/docs/inference_protocols.md b/docs/customization_guide/inference_protocols.md similarity index 65% rename from docs/inference_protocols.md rename to docs/customization_guide/inference_protocols.md index 350fb78b41..592f26e7d1 100644 --- a/docs/inference_protocols.md +++ b/docs/customization_guide/inference_protocols.md @@ -1,5 +1,5 @@ + +# Triton Examples + +**New to Triton Inference Server?** Make use of [these tutorials](https://github.com/triton-inference-server/tutorials) to begin your Triton journey! + +This folder contains the following: +* jetson: This covers deploying Triton Inference Server on Jetson devices. +* model_repository: This folder is a basic model repository for deploying models using the Triton Inference Server. \ No newline at end of file diff --git a/docs/examples/jetson/README.md b/docs/examples/jetson/README.md index fcd28e6c59..f149acbca4 100644 --- a/docs/examples/jetson/README.md +++ b/docs/examples/jetson/README.md @@ -1,5 +1,5 @@ + +::::{grid} +:reverse: +:gutter: 2 1 1 1 +:margin: 4 4 1 1 + +:::{grid-item} +:columns: 4 + +```{image} ./_static/nvidia-logo-vert-rgb-blk-for-screen.png +:width: 300px +``` +::: +:::{grid-item} +:columns: 8 +:class: sd-fs-3 + +NVIDIA Triton Inference Server + +::: +:::: + +Triton Inference Server is an open source inference serving software that streamlines AI inferencing. + + + +
+ +
+ +# Triton Inference Server + +Triton Inference Server enables teams to deploy any AI model from multiple deep +learning and machine learning frameworks, including TensorRT, TensorFlow, +PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference +across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM +CPU, or AWS Inferentia. Triton Inference Server delivers optimized performance +for many query types, including real time, batched, ensembles and audio/video +streaming. Triton inference Server is part of +[NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/), +a software platform that accelerates the data science pipeline and streamlines +the development and deployment of production AI. + +Major features include: + +- [Supports multiple deep learning + frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton) +- [Supports multiple machine learning + frameworks](https://github.com/triton-inference-server/fil_backend) +- [Concurrent model + execution](user_guide/architecture.md#concurrent-model-execution) +- [Dynamic batching](user_guide/model_configuration.md#dynamic-batcher) +- [Sequence batching](user_guide/model_configuration.md#sequence-batcher) and + [implicit state management](user_guide/architecture.md#implicit-state-management) + for stateful models +- Provides [Backend API](https://github.com/triton-inference-server/backend) that + allows adding custom backends and pre/post processing operations +- Model pipelines using + [Ensembling](user_guide/architecture.md#ensemble-models) or [Business + Logic Scripting + (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) +- [HTTP/REST and GRPC inference + protocols](customization_guide/inference_protocols.md) based on the community + developed [KServe + protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) +- A [C API](customization_guide/inference_protocols.md#in-process-triton-server-api) and + [Java API](customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api) + allow Triton to link directly into your application for edge and other in-process use cases +- [Metrics](user_guide/metrics.md) indicating GPU utilization, server + throughput, server latency, and more + +Join the [Triton and TensorRT community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and stay current on the latest product updates, bug fixes, content, best +practices, and more. Need enterprise support? NVIDIA global support is available +for Triton Inference Server with the [NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/). + +See the [Latest Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-23-05.html#rel-23-05) for updates on the newest features and bug fixes. diff --git a/docs/metrics.md b/docs/metrics.md deleted file mode 100644 index 6f9a15f918..0000000000 --- a/docs/metrics.md +++ /dev/null @@ -1,143 +0,0 @@ - - -# Metrics - -Triton provides [Prometheus](https://prometheus.io/) metrics -indicating GPU and request statistics. By default, these metrics are -available at http://localhost:8002/metrics. The metrics are only -available by accessing the endpoint, and are not pushed or published -to any remote server. The metric format is plain text so you can view -them directly, for example: - -``` -$ curl localhost:8002/metrics -``` - -The tritonserver --allow-metrics=false option can be used to disable -all metric reporting and --allow-gpu-metrics=false can be used to -disable just the GPU Utilization and GPU Memory metrics. The ---metrics-port option can be used to select a different port. For now, -Triton reuses http address for metrics endpoint. The option --http-address -can be used to bind http and metrics endpoints to the same specific address -when http service is enabled. - -The following table describes the available metrics. - -|Category |Metric |Description |Granularity|Frequency | -|--------------|----------------|---------------------------------------|-----------|-------------| -|GPU Utilization |Power Usage |GPU instantaneous power |Per GPU |Per second | -| |Power Limit |Maximum GPU power limit |Per GPU |Per second | -| |Energy Consumption|GPU energy consumption in joules since Triton started|Per GPU|Per second| -| |GPU Utilization |GPU utilization rate (0.0 - 1.0) |Per GPU |Per second | -|GPU Memory |GPU Total Memory|Total GPU memory, in bytes |Per GPU |Per second | -| |GPU Used Memory |Used GPU memory, in bytes |Per GPU |Per second | -|Count |Success Count |Number of successful inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model |Per request | -| |Failure Count |Number of failed inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model |Per request | -| |Inference Count |Number of inferences performed (a batch of "n" is counted as "n" inferences, does not include cached requests)|Per model|Per request| -| |Execution Count |Number of inference batch executions (see [Count Metrics](#count-metrics), does not include cached requests)|Per model|Per request| -|Latency |Request Time |Cumulative end-to-end inference request handling time (includes cached requests) |Per model |Per request | -| |Queue Time |Cumulative time requests spend waiting in the scheduling queue (includes cached requests) |Per model |Per request | -| |Compute Input Time|Cumulative time requests spend processing inference inputs (in the framework backend, does not include cached requests) |Per model |Per request | -| |Compute Time |Cumulative time requests spend executing the inference model (in the framework backend, does not include cached requests) |Per model |Per request | -| |Compute Output Time|Cumulative time requests spend processing inference outputs (in the framework backend, does not include cached requests) |Per model |Per request | -|Response Cache|Total Cache Entry Count |Total number of responses stored in response cache across all models |Server-wide |Per second | -| |Total Cache Lookup Count |Total number of response cache lookups done by Triton across all models |Server-wide |Per second | -| |Total Cache Hit Count |Total number of response cache hits across all models |Server-wide |Per second | -| |Total Cache Miss Count |Total number of response cache misses across all models |Server-wide |Per second | -| |Total Cache Eviction Count |Total number of response cache evictions across all models |Server-wide |Per second | -| |Total Cache Lookup Time |Cumulative time requests spend checking for a cached response across all models (microseconds) |Server-wide |Per second | -| |Total Cache Utilization |Total Response Cache utilization rate (0.0 - 1.0) |Server-wide |Per second | -| |Cache Hit Count |Number of response cache hits per model |Per model |Per request | -| |Cache Hit Lookup Time |Cumulative time requests spend retrieving a cached response per model on cache hits (microseconds) |Per model |Per request | -| |Cache Miss Count |Number of response cache misses per model |Per model |Per request | -| |Cache Miss Lookup Time |Cumulative time requests spend looking up a request hash on a cache miss (microseconds) |Per model |Per request | -| |Cache Miss Insertion Time |Cumulative time requests spend inserting responses into the cache on a cache miss (microseconds) |Per model |Per request | - - -## Response Cache - -Compute latency metrics in the table above are calculated for the -time spent in model inference backends. If the response cache is enabled for a -given model (see [Response Cache](https://github.com/triton-inference-server/server/blob/main/docs/response_cache.md) -docs for more info), total inference times may be affected by response cache -lookup times. - -On cache hits, "Cache Hit Lookup Time" indicates the time spent looking up the -response, and "Compute Input Time" / "Compute Time" / "Compute Output Time" -are not recorded. - -On cache misses, "Cache Miss Lookup Time" indicates the time spent looking up -the request hash and "Cache Miss Insertion Time" indicates the time spent -inserting the computed output tensor data into the cache. Otherwise, "Compute -Input Time" / "Compute Time" / "Compute Output Time" will be recorded as usual. - -## Count Metrics - -For models that do not support batching, *Request Count*, *Inference -Count* and *Execution Count* will be equal, indicating that each -inference request is executed separately. - -For models that support batching, the count metrics can be interpreted -to determine average batch size as *Inference Count* / *Execution -Count*. The count metrics are illustrated by the following examples: - -* Client sends a single batch-1 inference request. *Request Count* = - 1, *Inference Count* = 1, *Execution Count* = 1. - -* Client sends a single batch-8 inference request. *Request Count* = - 1, *Inference Count* = 8, *Execution Count* = 1. - -* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is not - enabled for the model. *Request Count* = 2, *Inference Count* = 9, - *Execution Count* = 2. - -* Client sends 2 requests: batch-1 and batch-1. Dynamic batcher is - enabled for the model and the 2 requests are dynamically batched by - the server. *Request Count* = 2, *Inference Count* = 2, *Execution - Count* = 1. - -* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is - enabled for the model and the 2 requests are dynamically batched by - the server. *Request Count* = 2, *Inference Count* = 9, *Execution - Count* = 1. - -## Custom Metrics - -Triton exposes a C API to allow users and backends to register and collect -custom metrics with the existing Triton metrics endpoint. The user takes the -ownership of the custom metrics created through the APIs and must manage their -lifetime following the API documentation. - -The -[identity_backend](https://github.com/triton-inference-server/identity_backend/blob/main/README.md#custom-metric-example) -demonstrates a practical example of adding a custom metric to a backend. - -Further documentation can be found in the `TRITONSERVER_MetricFamily*` and -`TRITONSERVER_Metric*` API annotations in -[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). diff --git a/docs/perf_analyzer.md b/docs/perf_analyzer.md deleted file mode 100644 index 5decc4c5e4..0000000000 --- a/docs/perf_analyzer.md +++ /dev/null @@ -1,667 +0,0 @@ - - -# Performance Analyzer - -A critical part of optimizing the inference performance of your model -is being able to measure changes in performance as you experiment with -different optimization strategies. The perf_analyzer application -(previously known as perf_client) performs this task for the Triton -Inference Server. The perf_analyzer is included with the client -examples which are [available from several -sources](https://github.com/triton-inference-server/client#getting-the-client-libraries-and-examples). - -The perf_analyzer application generates inference requests to your -model and measures the throughput and latency of those requests. To -get representative results, perf_analyzer measures the throughput and -latency over a time window, and then repeats the measurements until it -gets stable values. By default perf_analyzer uses average latency to -determine stability but you can use the --percentile flag to stabilize -results based on that confidence level. For example, if ---percentile=95 is used the results will be stabilized using the 95-th -percentile request latency. For example, - -``` -$ perf_analyzer -m inception_graphdef --percentile=95 -*** Measurement Settings *** - Batch size: 1 - Measurement window: 5000 msec - Using synchronous calls for inference - Stabilizing using p95 latency - -Request concurrency: 1 - Client: - Request count: 348 - Throughput: 69.6 infer/sec - p50 latency: 13936 usec - p90 latency: 18682 usec - p95 latency: 19673 usec - p99 latency: 21859 usec - Avg HTTP time: 14017 usec (send/recv 200 usec + response wait 13817 usec) - Server: - Inference count: 428 - Execution count: 428 - Successful request count: 428 - Avg request latency: 12005 usec (overhead 36 usec + queue 42 usec + compute input 164 usec + compute infer 11748 usec + compute output 15 usec) - -Inferences/Second vs. Client p95 Batch Latency -Concurrency: 1, throughput: 69.6 infer/sec, latency 19673 usec -``` - -## Request Concurrency - -By default perf_analyzer measures your model's latency and throughput -using the lowest possible load on the model. To do this perf_analyzer -sends one inference request to Triton and waits for the response. -When that response is received, the perf_analyzer immediately sends -another request, and then repeats this process during the measurement -windows. The number of outstanding inference requests is referred to -as the *request concurrency*, and so by default perf_analyzer uses a -request concurrency of 1. - -Using the --concurrency-range \:\:\ option you can have -perf_analyzer collect data for a range of request concurrency -levels. Use the --help option to see complete documentation for this -and other options. For example, to see the latency and throughput of -your model for request concurrency values from 1 to 4: - -``` -$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -*** Measurement Settings *** - Batch size: 1 - Measurement window: 5000 msec - Latency limit: 0 msec - Concurrency limit: 4 concurrent requests - Using synchronous calls for inference - Stabilizing using average latency - -Request concurrency: 1 - Client: - Request count: 339 - Throughput: 67.8 infer/sec - Avg latency: 14710 usec (standard deviation 2539 usec) - p50 latency: 13665 usec -... -Request concurrency: 4 - Client: - Request count: 415 - Throughput: 83 infer/sec - Avg latency: 48064 usec (standard deviation 6412 usec) - p50 latency: 47975 usec - p90 latency: 56670 usec - p95 latency: 59118 usec - p99 latency: 63609 usec - Avg HTTP time: 48166 usec (send/recv 264 usec + response wait 47902 usec) - Server: - Inference count: 498 - Execution count: 498 - Successful request count: 498 - Avg request latency: 45602 usec (overhead 39 usec + queue 33577 usec + compute input 217 usec + compute infer 11753 usec + compute output 16 usec) - -Inferences/Second vs. Client Average Batch Latency -Concurrency: 1, throughput: 67.8 infer/sec, latency 14710 usec -Concurrency: 2, throughput: 89.8 infer/sec, latency 22280 usec -Concurrency: 3, throughput: 80.4 infer/sec, latency 37283 usec -Concurrency: 4, throughput: 83 infer/sec, latency 48064 usec -``` - -## Understanding The Output - -For each request concurrency level perf_analyzer reports latency and -throughput as seen from the *client* (that is, as seen by -perf_analyzer) and also the average request latency on the server. - -The server latency measures the total time from when the request is -received at the server until the response is sent from the -server. Because of the HTTP and GRPC libraries used to implement the -server endpoints, total server latency is typically more accurate for -HTTP requests as it measures time from first byte received until last -byte sent. For both HTTP and GRPC the total server latency is -broken-down into the following components: - -- *queue*: The average time spent in the inference schedule queue by a - request waiting for an instance of the model to become available. -- *compute*: The average time spent performing the actual inference, - including any time needed to copy data to/from the GPU. - -The client latency time is broken-down further for HTTP and GRPC as -follows: - -- HTTP: *send/recv* indicates the time on the client spent sending the - request and receiving the response. *response wait* indicates time - waiting for the response from the server. -- GRPC: *(un)marshal request/response* indicates the time spent - marshalling the request data into the GRPC protobuf and - unmarshalling the response data from the GRPC protobuf. *response - wait* indicates time writing the GRPC request to the network, - waiting for the response, and reading the GRPC response from the - network. - -Use the verbose (-v) option to perf_analyzer to see more output, -including the stabilization passes run for each request concurrency -level. - -## Visualizing Latency vs. Throughput - -The perf_analyzer provides the -f option to generate a file containing -CSV output of the results. - -``` -$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv -$ cat perf.csv -Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency -1,69.2,225,2148,64,206,11781,19,0,13891,18795,19753,21018 -3,84.2,237,1768,21673,209,11742,17,0,35398,43984,47085,51701 -4,84.2,279,1604,33669,233,11731,18,1,47045,56545,59225,64886 -2,87.2,235,1973,9151,190,11346,17,0,21874,28557,29768,34766 -``` - -NOTE: The rows in the CSV file are sorted in an increasing order of throughput (Inferences/Second). - -You can import the CSV file into a spreadsheet to help visualize -the latency vs inferences/second tradeoff as well as see some -components of the latency. Follow these steps: - -- Open [this - spreadsheet](https://docs.google.com/spreadsheets/d/1S8h0bWBBElHUoLd2SOvQPzZzRiQ55xjyqodm_9ireiw) -- Make a copy from the File menu "Make a copy..." -- Open the copy -- Select the A1 cell on the "Raw Data" tab -- From the File menu select "Import..." -- Select "Upload" and upload the file -- Select "Replace data at selected cell" and then select the "Import data" button - -## Input Data - -Use the --help option to see complete documentation for all input -data options. By default perf_analyzer sends random data to all the -inputs of your model. You can select a different input data mode with -the --input-data option: - -- *random*: (default) Send random data for each input. -- *zero*: Send zeros for each input. -- directory path: A path to a directory containing a binary file for each input, named the same as the input. Each binary file must contain the data required for that input for a batch-1 request. Each file should contain the raw binary representation of the input in row-major order. -- file path: A path to a JSON file containing data to be used with every inference request. See the "Real Input Data" section for further details. --input-data can be provided multiple times with different file paths to specific multiple JSON files. - -For tensors with with STRING/BYTES datatype there are additional -options --string-length and --string-data that may be used in some -cases (see --help for full documentation). - -For models that support batching you can use the -b option to indicate -the batch-size of the requests that perf_analyzer should send. For -models with variable-sized inputs you must provide the --shape -argument so that perf_analyzer knows what shape tensors to use. For -example, for a model that has an input called *IMAGE* that has shape [ -3, N, M ], where N and M are variable-size dimensions, to tell -perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]: - -``` -$ perf_analyzer -m mymodel -b 4 --shape IMAGE:3,224,224 -``` - -## Real Input Data - -The performance of some models is highly dependent on the data used. -For such cases you can provide data to be used with every inference -request made by analyzer in a JSON file. The perf_analyzer will use -the provided data in a round-robin order when sending inference -requests. - -Each entry in the "data" array must specify all input tensors with the -exact size expected by the model from a single batch. The following -example describes data for a model with inputs named, INPUT0 and -INPUT1, shape [4, 4] and data type INT32: - -``` - { - "data" : - [ - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - }, - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - }, - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - }, - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - ... - ] - } -``` - -Note that the [4, 4] tensor has been flattened in a row-major format -for the inputs. In addition to specifying explicit tensors, you can -also provide Base64 encoded binary data for the tensors. Each data -object must list its data in a row-major order. Binary data must be in -little-endian byte order. The following example highlights how this -can be acheived: - -``` - { - "data" : - [ - { - "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="}, - "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="} - }, - { - "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="}, - "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="} - }, - { - "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="}, - "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="} - }, - ... - ] - } -``` - -In case of sequence models, multiple data streams can be specified in -the JSON file. Each sequence will get a data stream of its own and the -analyzer will ensure the data from each stream is played back to the -same correlation id. The below example highlights how to specify data -for multiple streams for a sequence model with a single input named -INPUT, shape [1] and data type STRING: - -``` - { - "data" : - [ - [ - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["2"] - }, - { - "INPUT" : ["3"] - }, - { - "INPUT" : ["4"] - } - ], - [ - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["1"] - } - ], - [ - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["1"] - } - ] - ] - } -``` - -The above example describes three data streams with lengths 4, 3 and 2 -respectively. The perf_analyzer will hence produce sequences of -length 4, 3 and 2 in this case. - -You can also provide an optional "shape" field to the tensors. This is -especially useful while profiling the models with variable-sized -tensors as input. Additionally note that when providing the "shape" field, -tensor contents must be provided separately in "content" field in row-major -order. The specified shape values will override default input shapes -provided as a command line option (see --shape) for variable-sized inputs. -In the absence of "shape" field, the provided defaults will be used. There -is no need to specify shape as a command line option if all the data steps -provide shape values for variable tensors. Below is an example json file -for a model with single input "INPUT", shape [-1,-1] and data type INT32: - -``` - { - "data" : - [ - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "shape": [2,8] - } - }, - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "shape": [8,2] - } - }, - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - }, - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "shape": [4,4] - } - } - ... - ] - } -``` - -The following is the example to provide contents as base64 string with explicit shapes: - -``` -{ - "data": [{ - "INPUT": { - "content": {"b64": "/9j/4AAQSkZ(...)"}, - "shape": [7964] - }}, - (...)] -} -``` - -### Output Validation - -When real input data is provided, it is optional to request perf analyzer to -validate the inference output for the input data. - -Validation output can be specified in "validation_data" field in the same format -as "data" field for real input. Note that the entries in "validation_data" must -align with "data" for proper mapping. The following example describes validation -data for a model with inputs named, INPUT0 and INPUT1, outputs named, OUTPUT0 -and OUTPUT1, all tensors have shape [4, 4] and data type INT32: - -``` - { - "data" : - [ - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - ... - ], - "validation_data" : - [ - { - "OUTPUT0" : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - "OUTPUT1" : [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] - } - ... - ] - } -``` - -Besides the above example, the validation outputs can be specified in the same -variations described in "real input data" section. - -## Shared Memory - -By default perf_analyzer sends input tensor data and receives output -tensor data over the network. You can instead instruct perf_analyzer to -use system shared memory or CUDA shared memory to communicate tensor -data. By using these options you can model the performance that you -can achieve by using shared memory in your application. Use ---shared-memory=system to use system (CPU) shared memory or ---shared-memory=cuda to use CUDA shared memory. - -## Communication Protocol - -By default perf_analyzer uses HTTP to communicate with Triton. The GRPC -protocol can be specificed with the -i option. If GRPC is selected the ---streaming option can also be specified for GRPC streaming. - -### SSL/TLS Support - -perf_analyzer can be used to benchmark Triton service behind SSL/TLS-enabled endpoints. These options can help in establishing secure connection with the endpoint and profile the server. - -For gRPC, see the following options: - -* `--ssl-grpc-use-ssl` -* `--ssl-grpc-root-certifications-file` -* `--ssl-grpc-private-key-file` -* `--ssl-grpc-certificate-chain-file` - -More details here: https://grpc.github.io/grpc/cpp/structgrpc_1_1_ssl_credentials_options.html - -The [inference protocol gRPC SSL/TLS section](inference_protocols.md#ssltls) describes server-side options to configure SSL/TLS in Triton's gRPC endpoint. - -For HTTPS, the following options are exposed: - -* `--ssl-https-verify-peer` -* `--ssl-https-verify-host` -* `--ssl-https-ca-certificates-file` -* `--ssl-https-client-certificate-file` -* `--ssl-https-client-certificate-type` -* `--ssl-https-private-key-file` -* `--ssl-https-private-key-type` - -See `--help` for full documentation. - -Unlike gRPC, Triton's HTTP server endpoint can not be configured with SSL/TLS support. - -Note: Just providing these `--ssl-http-*` options to perf_analyzer does not ensure the SSL/TLS is used in communication. If SSL/TLS is not enabled on the service endpoint, these options have no effect. The intent of exposing these options to a user of perf_analyzer is to allow them to configure perf_analyzer to benchmark Triton service behind SSL/TLS-enabled endpoints. In other words, if Triton is running behind a HTTPS server proxy, then these options would allow perf_analyzer to profile Triton via exposed HTTPS proxy. - -## Benchmarking Triton directly via C API - -Besides using HTTP or gRPC server endpoints to communicate with Triton, perf_analyzer also allows user to benchmark Triton directly using C API. HTTP/gRPC endpoints introduce an additional latency in the pipeline which may not be of interest to the user who is using Triton via C API within their application. Specifically, this feature is useful to benchmark bare minimum Triton without additional overheads from HTTP/gRPC communication. - -### Prerequisite -Pull the Triton SDK and the Inference Server container images on target machine. -Since you will need access to the Tritonserver install, it might be easier if -you copy the perf_analyzer binary to the Inference Server container. - -### Required Parameters -Use the --help option to see complete list of supported command line arguments. -By default perf_analyzer expects the Triton instance to already be running. You can configure the C API mode using the `--service-kind` option. In additon, you will need to point -perf_analyzer to the Triton server library path using the `--triton-server-directory` option and the model -repository path using the `--model-repository` option. -If the server is run successfully, there is a prompt: "server is alive!" and perf_analyzer will print the stats, as normal. -An example run would look like: -``` -perf_analyzer -m graphdef_int32_int32_int32 --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/workspace/qa/L0_perf_analyzer_capi/models -``` - -### Non-supported functionalities -There are a few functionalities that are missing from the C API. They are: -1. Async mode (`-a`) -2. Using shared memory mode (`--shared-memory=cuda` or `--shared-memory=system`) -3. Request rate range mode -4. For additonal known non-working cases, please refer to - [qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277) - - -## Benchmarking TensorFlow Serving -perf_analyzer can also be used to benchmark models deployed on -[TensorFlow Serving](https://github.com/tensorflow/serving) using -the `--service-kind` option. The support is however only available -through gRPC protocol. - -Following invocation demonstrates how to configure perf_analyzer -to issue requests to a running instance of -`tensorflow_model_server`: - -``` -$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500 -*** Measurement Settings *** - Batch size: 1 - Using "time_windows" mode for stabilization - Measurement window: 5000 msec - Using synchronous calls for inference - Stabilizing using average latency -Request concurrency: 1 - Client: - Request count: 829 - Throughput: 165.8 infer/sec - Avg latency: 6032 usec (standard deviation 569 usec) - p50 latency: 5863 usec - p90 latency: 6655 usec - p95 latency: 6974 usec - p99 latency: 8093 usec - Avg gRPC time: 5984 usec ((un)marshal request/response 257 usec + response wait 5727 usec) -Inferences/Second vs. Client Average Batch Latency -Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec -``` - -You might have to specify a different url(`-u`) to access wherever -the server is running. The report of perf_analyzer will only -include statistics measured at the client-side. - -**NOTE:** The support is still in **beta**. perf_analyzer does -not guarantee optimum tuning for TensorFlow Serving. However, a -single benchmarking tool that can be used to stress the inference -servers in an identical manner is important for performance -analysis. - - -The following points are important for interpreting the results: -1. `Concurrent Request Execution`: -TensorFlow Serving (TFS), as of version 2.8.0, by default creates -threads for each request that individually submits requests to -TensorFlow Session. There is a resource limit on the number of -concurrent threads serving requests. When benchmarking at a higher -request concurrency, you can see higher throughput because of this. -Unlike TFS, by default Triton is configured with only a single -[instance count](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups). Hence, at a higher request concurrency, most -of the requests are blocked on the instance availability. To -configure Triton to behave like TFS, set the instance count to a -reasonably high value and then set -[MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters) -parameter in the model confib.pbtxt to the same value.For some -context, the TFS sets its thread constraint to four times the -num of schedulable CPUs. -2. `Different library versions`: -The version of TensorFlow might differ between Triton and -TensorFlow Serving being benchmarked. Even the versions of cuda -libraries might differ between the two solutions. The performance -of models can be susceptible to the versions of these libraries. -For a single request concurrency, if the compute_infer time -reported by perf_analyzer when benchmarking Triton is as large as -the latency reported by perf_analyzer when benchmarking TFS, then -the performance difference is likely because of the difference in -the software stack and outside the scope of Triton. -3. `CPU Optimization`: -TFS has separate builds for CPU and GPU targets. They have -target-specific optimization. Unlike TFS, Triton has a single build -which is optimized for execution on GPUs. When collecting performance -on CPU models on Triton, try running Triton with the environment -variable `TF_ENABLE_ONEDNN_OPTS=1`. - - -## Benchmarking TorchServe -perf_analyzer can also be used to benchmark -[TorchServe](https://github.com/pytorch/serve) using the -`--service-kind` option. The support is however only available through -HTTP protocol. It also requires input to be provided via JSON file. - -Following invocation demonstrates how to configure perf_analyzer to -issue requests to a running instance of `torchserve` assuming the -location holds `kitten_small.jpg`: - -``` -$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json - Successfully read data for 1 stream/streams with 1 step/steps. -*** Measurement Settings *** - Batch size: 1 - Using "time_windows" mode for stabilization - Measurement window: 5000 msec - Using synchronous calls for inference - Stabilizing using average latency -Request concurrency: 1 - Client: - Request count: 799 - Throughput: 159.8 infer/sec - Avg latency: 6259 usec (standard deviation 397 usec) - p50 latency: 6305 usec - p90 latency: 6448 usec - p95 latency: 6494 usec - p99 latency: 7158 usec - Avg HTTP time: 6272 usec (send/recv 77 usec + response wait 6195 usec) -Inferences/Second vs. Client Average Batch Latency -Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec -``` - -The content of `data.json`: - -``` - { - "data" : - [ - { - "TORCHSERVE_INPUT" : ["kitten_small.jpg"] - } - ] - } -``` - -You might have to specify a different url(`-u`) to access wherever -the server is running. The report of perf_analyzer will only include -statistics measured at the client-side. - -**NOTE:** The support is still in **beta**. perf_analyzer does not -guarantee optimum tuning for TorchServe. However, a single benchmarking -tool that can be used to stress the inference servers in an identical -manner is important for performance analysis. - -## Advantages of using Perf Analyzer over third-party benchmark suites - -Triton Inference Server offers the entire serving solution which -includes [client libraries](https://github.com/triton-inference-server/client) -that are optimized for Triton. -Using third-party benchmark suites like jmeter fails to take advantage of the -optimized libraries. Some of these optimizations includes but are not limited -to: -1. Using [binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) with HTTP requests. -2. Effective re-use of gRPC message allocation in subsequent requests. -3. Avoiding extra memory copy via libcurl interface. - -These optimizations can have a tremendous impact on overall performance. -Using perf_analyzer for benchmarking directly allows a user to access -these optimizations in their study. - -Not only that, perf_analyzer is also very customizable and supports many -Triton features as described in this document. This, along with a detailed -report, allows a user to identify performance bottlenecks and experiment -with different features before deciding upon what works best for them. diff --git a/docs/protocol/README.md b/docs/protocol/README.md index 3ce381c8c8..ddec7fc1d3 100644 --- a/docs/protocol/README.md +++ b/docs/protocol/README.md @@ -1,5 +1,5 @@ + +# Generate Extension + +> [!NOTE] +> The Generate Extension is *provisional* and likely to change in future versions. + +This document describes Triton's generate extension. The generate +extension provides a simple text-oriented endpoint schema for interacting with +large language models (LLMs). The generate endpoint is specific to HTTP/REST +frontend. + +## HTTP/REST + +In all JSON schemas shown in this document, `$number`, `$string`, `$boolean`, +`$object` and `$array` refer to the fundamental JSON types. #optional +indicates an optional JSON field. + +Triton exposes the generate endpoint at the following URLs. The client may use +HTTP POST request to different URLs for different response behavior, the +endpoint will return the generate results on success or an error in the case of +failure. + +``` +POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate + +POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate_stream +``` + +### generate vs. generate_stream + +Both URLs expect the same request JSON object, and generate the same JSON +response object. However, there are some differences in the format used to +return each: +* `/generate` returns exactly 1 response JSON object with a +`Content-Type` of `application/json` +* `/generate_stream` may return multiple responses based on the inference +results, with a `Content-Type` of `text/event-stream; charset=utf-8`. +These responses will be sent as +[Server-Sent Events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events) +(SSE), where each response will be a "data" chunk in the HTTP +response body. In the case of inference errors, responses will have +an [error JSON object](#generate-response-json-error-object). + * Note that the HTTP response code is set in the first response of the SSE, + so if the first response succeeds but an error occurs in a subsequent + response for the request, it can result in receiving an error object + while the status code shows success (200). Therefore, the user must + always check whether an error object is received when generating + responses through `/generate_stream`. + * If the request fails before inference begins, then a JSON error will + be returned with `Content-Type` of `application/json`, similar to errors + from other endpoints with the status code set to an error. + +### Generate Request JSON Object + +The generate request object, identified as *$generate_request*, is +required in the HTTP body of the POST request. The model name and +(optionally) version must be available in the URL. If a version is not +provided, the server may choose a version based on its own policies or +return an error. + + $generate_request = + { + "text_input" : $string, + "parameters" : $parameters #optional + } + +* "text_input" : The text input that the model should generate output from. +* "parameters" : An optional object containing zero or more parameters for this + generate request expressed as key/value pairs. See + [Parameters](#parameters) for more information. + +> [!NOTE] +> Any additional properties in the request object are passed either as +> parameters or tensors based on model specification. + +#### Parameters + +The `$parameters` JSON describes zero or more “name”/”value” pairs, +where the “name” is the name of the parameter and the “value” is a +`$string`, `$number`, or `$boolean`. + + $parameters = + { + $parameter, ... + } + + $parameter = $string : $string | $number | $boolean + +Parameters are model-specific. The user should check with the model +specification to set the parameters. + +#### Example Request + +Below is an example to send generate request with additional model parameters `stream` and `temperature`. + +``` +$ curl -X POST localhost:8000/v2/models/mymodel/generate -d '{"text_input": "client input", "parameters": {"stream": false, "temperature": 0}}' + +POST /v2/models/mymodel/generate HTTP/1.1 +Host: localhost:8000 +Content-Type: application/json +Content-Length: +{ + "text_input": "client input", + "parameters" : + { + "stream": false, + "temperature": 0 + } +} +``` + +### Generate Response JSON Object + +A successful generate request is indicated by a 200 HTTP status code. +The generate response object, identified as `$generate_response`, is returned in +the HTTP body. + + $generate_response = + { + "model_name" : $string, + "model_version" : $string, + "text_output" : $string + } + +* "model_name" : The name of the model used for inference. +* "model_version" : The specific model version used for inference. +* "text_output" : The output of the inference. + +#### Example Response + +``` +200 +{ + "model_name" : "mymodel", + "model_version" : "1", + "text_output" : "model output" +} +``` + +### Generate Response JSON Error Object + +A failed generate request must be indicated by an HTTP error status +(typically 400). The HTTP body must contain the +`$generate_error_response` object. + + $generate_error_response = + { + "error": + } + +* “error” : The descriptive message for the error. + +#### Example Error + +``` +400 +{ + "error" : "error message" +} +``` diff --git a/docs/protocol/extension_logging.md b/docs/protocol/extension_logging.md new file mode 100644 index 0000000000..e30c22b784 --- /dev/null +++ b/docs/protocol/extension_logging.md @@ -0,0 +1,198 @@ + + +# Logging Extension + +This document describes Triton's logging extension. The logging extension enables +the client to configure log settings during a Triton run. Triton reports "logging" +in the extensions field of its Server Metadata. + +## HTTP/REST + +In all JSON schemas shown in this document `$number`, `$string`, `$boolean`, +`$object` and `$array` refer to the fundamental JSON types. #optional +indicates an optional JSON field. + +Triton exposes the logging endpoint at the following URL. The client may use +HTTP GET request to retrieve the current log settings. A HTTP POST request +will modify the log settings, and the endpoint will return the updated log +settings on success or an error in the case of failure. + +``` +GET v2/logging + +POST v2/logging +``` + +### Log Setting Response JSON Object + +A successful log setting request is indicated by a 200 HTTP status +code. The response object, identified as `$log_setting_response`, is +returned in the HTTP body for every successful log setting request. + +``` +$log_setting_response = +{ + $log_setting, ... +} + +$log_setting = $string : $string | $boolean | $number +``` + +Each `$log_setting` JSON describes a “name”/”value” pair, where the “name” is +the `$string` representation of the log setting and the “value” is a `$string`, +`$bool`, or `$number` representation of the setting value. Currently, the +following log settings are defined: + +- "log_file" : a `$string` parameter defining the file where the log outputs will be saved. If an empty string is specified, log outputs will stream to the console. + +- "log_info" : a `$boolean` parameter that controls whether the Triton server logs INFO level messages. + +- "log_warning" : a `$boolean` parameter that controls whether the Triton server logs WARNING level messages. + +- "log_error" : a `$boolean` parameter that controls whether the Triton server logs ERROR level messages. + +- "log_verbose_level" : a `$number` parameter that controls whether the Triton server outputs verbose messages +of varying degrees. This value can be any integer >= 0. If "log_verbose_level" is 0, verbose logging will be disabled, and +no verbose messages will be output by the Triton server. If "log_verbose_level" is 1, level 1 verbose messages will be output +by the Triton server. If "log_verbose_level" is 2, the Triton server will output all verbose messages of +level <= 2, etc. Attempting to set "log_verbose_level" to a number < 0 will result in an error. + +- "log_format" : a `$string` parameter that controls the format of Triton server log messages. There are currently +2 formats: "default" and "ISO8601". + + +### Log Setting Response JSON Error Object + +A failed log setting request will be indicated by an HTTP error status +(typically 400). The HTTP body will contain a `$log_setting_error_response` object. + +``` +$log_setting_error_response = +{ + "error": $string +} +``` + +- “error” : The descriptive message for the error. + +### Log Setting Request JSON Object + +A log setting request is made with a HTTP POST to +the logging endpoint. In the corresponding response, the HTTP body contains the +response JSON. A successful request is indicated by a 200 HTTP status code. + +The request object, identified as `$log_setting_request` must be provided in the HTTP +body. + +``` +$log_setting_request = +{ + $log_setting, ... +} +``` + +When a `$log_setting` JSON is received (defined above), only the specified +settings will be updated. + +### Example Usage +The logging protocol extension can be invoked using the curl library in the following manner (assuming +a Triton server is running at `localhost:8000`): +``` +curl -s -w '\n%{http_code}\n' -d '{"log_verbose_level":1}' -X POST localhost:8000/v2/logging +``` +This command should return a `$log_setting_response` JSON object with the following format: +``` +{"log_file":"","log_info":true,"log_warnings":true,"log_errors":true,"log_verbose_level":1,"log_format":"default"} +200 +``` +Note that the current values for all parameter fields are returned even though `log_verbose_level` +was the only parameter that was modified. + +## GRPC + +For the logging extension, Triton implements the following API: + +``` +service GRPCInferenceService +{ + … + + // Update and get the log setting of the Triton server. + rpc LogSettings(LogSettingsRequest) + returns (LogSettingsResponse) {} +} +``` + +The Log Setting API returns the latest log settings. Errors are indicated +by the `google.rpc.Status` returned for the request. The OK code +indicates success and other codes indicate failure. The request and +response messages for Log Settings are: + +``` +message LogSettingsRequest +{ + message SettingValue + { + oneof parameter_choice + { + // bool param option + bool bool_param = 1; + + // uint32 param option + uint32 uint32_param = 2; + + // string param option + string string_param = 3; + } + } + // The new setting values to be updated. + // Unspecified settings will remain unchanged. + map settings = 1; +} + +message LogSettingsResponse +{ + message SettingValue + { + oneof parameter_choice + { + // bool param option + bool bool_param = 1; + + // uint32 param option + uint32 uint32_param = 2; + + // string param option + string string_param = 3; + } + } + // The latest log settings values. + map settings = 1; +} +``` diff --git a/docs/protocol/extension_model_configuration.md b/docs/protocol/extension_model_configuration.md index 6e995cf77c..04a2d28fac 100644 --- a/docs/protocol/extension_model_configuration.md +++ b/docs/protocol/extension_model_configuration.md @@ -1,5 +1,5 @@ + +# Parameters Extension + +This document describes Triton's parameters extension. The +parameters extension allows an inference request to provide +custom parameters that cannot be provided as inputs. Because this extension is +supported, Triton reports “parameters” in the extensions field of its +Server Metadata. This extension uses the optional "parameters" +field in the KServe Protocol in +[HTTP](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#inference-request-json-object) +and +[GRPC](https://kserve.github.io/website/0.10/modelserving/data_plane/v2_protocol/#parameters). + +The following parameters are reserved for Triton's usage and should not be +used as custom parameters: + +- sequence_id +- priority +- timeout +- sequence_start +- sequence_end +- headers +- All the keys that start with `"triton_"` prefix. Some examples used today: + - `"triton_enable_empty_final_response"` request parameter + - `"triton_final_response"` response parameter + +When using both GRPC and HTTP endpoints, you need to make sure to not use +the reserved parameters list to avoid unexpected behavior. The reserved +parameters are not accessible in the Triton C-API. + +## HTTP/REST + +The following example shows how a request can include custom parameters. + +``` +POST /v2/models/mymodel/infer HTTP/1.1 +Host: localhost:8000 +Content-Type: application/json +Content-Length: +{ + "parameters" : { "my_custom_parameter" : 42 } + "inputs" : [ + { + "name" : "input0", + "shape" : [ 2, 2 ], + "datatype" : "UINT32", + "data" : [ 1, 2, 3, 4 ] + } + ], + "outputs" : [ + { + "name" : "output0", + } + ] +} +``` + +## GRPC + +The `parameters` field in the +ModelInferRequest message can be used to send custom parameters. + +## Forwarding HTTP/GRPC Headers as Parameters + +Triton can forward HTTP/GRPC headers as inference request parameters. By +specifying a regular expression in `--http-header-forward-pattern` and +`--grpc-header-forward-pattern`, +Triton will add the headers that match with the regular expression as request +parameters. All the forwarded headers will be added as a parameter with string +value. For example to forward all the headers that start with 'PREFIX_' from +both HTTP and GRPC, you should add `--http-header-forward-pattern PREFIX_.* +--grpc-header-forward-pattern PREFIX_.*` to your `tritonserver` command. + +The forwarded headers can be accessed using the +[Python](https://github.com/triton-inference-server/python_backend#inference-request-parameters) +or C Backend APIs as inference request parameters. + diff --git a/docs/protocol/extension_schedule_policy.md b/docs/protocol/extension_schedule_policy.md index a49a97a3de..c3c57a63c7 100644 --- a/docs/protocol/extension_schedule_policy.md +++ b/docs/protocol/extension_schedule_policy.md @@ -1,5 +1,5 @@ - -# Triton Response Cache (beta) - -**This feature is currently in beta and may be subject to change.** - -In this document an *inference request* is the model name, model version, and -input tensors (name, shape, datatype and tensor data) that make up a request -submitted to Triton. An inference result is the output tensors (name, shape, -datatype and tensor data) produced by an inference execution. The response cache -is used by Triton to hold inference results generated for previous executed -inference requests. Triton will maintain the response cache so that inference -requests that hit in the cache will not need to execute a model to produce -results and will instead extract their results from the cache. For some use -cases this can significantly reduce the inference request latency. - -The response cache is enabled by setting a non-zero size when Triton is launched -using the `--response-cache-byte-size` flag. The flag defaults to 0 (zero). When -non-zero, Triton allocates the requested size in CPU memory and **shares the -cache across all inference requests and across all models**. For a given model -to use response caching, the model must enable response caching in the model -configuration. **By default, no model uses response caching even if the response -cache is enabled with the `--response-cache-byte-size` flag.** For more -information on enabling the response cache for each model, see the [model -configuration -docs](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#response-cache). - -Triton accesses the response cache with a hash of the inference request that -includes the model name, model version and model inputs. If the hash is found in -the cache, the corresponding inference result is extracted from the cache and -used for the request. When this happens there is no need for Triton to execute -the model to produce the inference result. If the hash is not found in the -cache, Triton executes the model to produce the inference result, and then -records that result in the cache so that subsequent inference requests can -(re)use those results. - -The response cache is a fixed-size resource, as a result it must be managed by a -replacement policy when the number of cacheable responses exceeds the capacity -of the cache. Currently, the cache only implements a least-recently-used -([LRU](https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU))) -replacement policy which will automatically evict one or more LRU entries to -make room for new entries. - -## Known Limitations - -- Only input tensors located in CPU memory will be hashable for accessing the - cache. If an inference request contains input tensors not in CPU memory, the - request will not be hashed and therefore the response will not be cached. -- Only responses with all output tensors located in CPU memory will be eligible - for caching. If any output tensor in a response is not located in CPU memory, - the response will not be cached. -- The cache is accessed using only the inference request hash. As a result, if - two different inference requests generate the same hash (a hash collision), - then Triton may incorrectly use the cached result for an inference request. - The hash is a 64-bit value so the likelihood of collision is small. -- Only successful inference requests will have their responses cached. If a - request fails or returns an error during inference, its response will not be - cached. -- Only requests going through the Default Scheduler or Dynamic Batch Scheduler - are eligible for caching. The Sequence Batcher does not currently support - response caching. diff --git a/docs/trace.md b/docs/trace.md deleted file mode 100644 index 4725925dee..0000000000 --- a/docs/trace.md +++ /dev/null @@ -1,305 +0,0 @@ - - -# Triton Server Trace - -Triton includes that capability to generate a detailed trace for -individual inference requests. Tracing is enable by command-line -arguments when running the tritonserver executable. For example, - -``` -$ tritonserver --trace-file=/tmp/trace.json --trace-rate=100 --trace-level=TIMESTAMPS ... -``` - -The --trace-file option indicates where the trace output should be -written. The --trace-rate option specifies the sampling rate. In -this example every 100-th inference request will be traced. The ---trace-level option indicates the level of trace detail that should -be collected. --trace-level option may be specified multiple times to -trace multiple informations. Use the --help option to get more information. - -In addition to configure trace settings in command line arguments, The user may -modify the trace setting when Triton server -is running via the trace APIs, more information can be found in [trace -protocol](protocol/extension_trace.md). - -## Supported Trace Level Option - -- `TIMESTAMPS`: Tracing execution timestamps of each request. -- `TENSORS`: Tracing input and output tensors during the execution. - -## JSON Trace Output - -The trace output is a JSON file with the following schema. - -``` -[ - { - "model_name": $string, - "model_version": $number, - "id": $number - "parent_id": $number, - "timestamps": [ - { "name" : $string, "ns" : $number }, - ... - ] - }, - { - "model_name": $string, - "model_version": $number, - "id": $number - "activity": $string, - "tensor":{ - "name": $string, - "data": $string, - "dtype": $string - } - }, - ... -] -``` - -Each trace is assigned a "id", which indicates the model name and -version of the inference request. If the trace is from a -model run as part of an ensemble, the "parent_id" will indicate the -"id" of the containing ensemble. - -Each `TIMESTAMPS` trace will have one or more "timestamps" with -each timestamp having a name and the timestamp in nanoseconds ("ns"). -For example: - -``` -[ - { - "model_name": "simple", - "model_version": -1, - "id": 1, - "timestamps" : [ - { "name": "http recv start", "ns": 2259961222771924 }, - { "name": "http recv end", "ns": 2259961222820985 }, - { "name": "request handler start", "ns": 2259961223164078 }, - { "name": "queue start", "ns": 2259961223182400 }, - { "name": "compute start", "ns": 2259961223232405 }, - { "name": "compute end", "ns": 2259961230206777 }, - { "name": "request handler end", "ns": 2259961230211887 }, - { "name": "http send start", "ns": 2259961230529606 }, - { "name": "http send end", "ns": 2259961230543930 } - ] - } -] -``` - -Each `TENSORS` trace will contain an "activity" and a "tensor". -"activity" indicates the type of tensor, including "TENSOR_QUEUE_INPUT" -and "TENSOR_BACKEND_OUTPUT" by now. "tensor" has the detail of tensor, -including its "name", "data" and "dtype". For example: - -``` -[ - { - "model_name": "simple", - "model_version": -1, - "id": 1, - "activity": "TENSOR_QUEUE_INPUT", - "tensor":{ - "name": "input", - "data": "0.1,0.1,0.1,...", - "dtype": "FP32" - } - } -] -``` - -## Trace Summary Tool - -An example [trace summary tool](../qa/common/trace_summary.py) can be -used to summarize a set of traces collected from Triton. Basic usage -is: - -``` -$ trace_summary.py -``` - -This produces a summary report for all traces in the file. HTTP and -GRPC inference requests are reported separately. - -``` -File: trace.json -Summary for simple (-1): trace count = 1 -HTTP infer request (avg): 378us - Receive (avg): 21us - Send (avg): 7us - Overhead (avg): 79us - Handler (avg): 269us - Overhead (avg): 11us - Queue (avg): 15us - Compute (avg): 242us - Input (avg): 18us - Infer (avg): 208us - Output (avg): 15us -Summary for simple (-1): trace count = 1 -GRPC infer request (avg): 21441us - Wait/Read (avg): 20923us - Send (avg): 74us - Overhead (avg): 46us - Handler (avg): 395us - Overhead (avg): 16us - Queue (avg): 47us - Compute (avg): 331us - Input (avg): 30us - Infer (avg): 286us - Output (avg): 14us -``` - -Use the -t option to get a summary for each trace in the file. This -summary shows the time, in microseconds, between different points in -the processing of an inference request. For example, the below output -shows that it took 15us from the start of handling the request until -the request was enqueued in the scheduling queue. - -``` -$ trace_summary.py -t -... -simple (-1): - grpc wait/read start - 26529us - grpc wait/read end - 39us - request handler start - 15us - queue start - 20us - compute start - 266us - compute end - 4us - request handler end - 19us - grpc send start - 77us - grpc send end -... -``` - -The script can also show the data flow of the first request if there are -`TENSORS` traces in the file. If the `TENSORS` traces are from an ensemble, -the data flow will be shown with the dependency of each model. - -``` -... -Data Flow: - ========================================================== - Name: ensemble - Version:1 - QUEUE_INPUT: - input: [[0.705676 0.830855 0.833153]] - BACKEND_OUTPUT: - output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]] - ========================================================== - ================================================== - Name: test_trt1 - Version:1 - QUEUE_INPUT: - input: [[0.705676 0.830855 0.833153]] - BACKEND_OUTPUT: - output1: [[1. 1. ...]] - ================================================== - ================================================== - Name: test_trt2 - Version:1 - QUEUE_INPUT: - input: [[0.705676 0.830855 0.833153]] - BACKEND_OUTPUT: - output2: [[2. 2. ...]] - ================================================== - ================================================== - Name: test_py - Version:1 - QUEUE_INPUT: - output1: [[1. 1. ...]] - QUEUE_INPUT: - output2: [[2. 2. ...]] - BACKEND_OUTPUT: - output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]] - ================================================== -... -``` - -The meaning of the trace timestamps is: - -* GRPC Request Wait/Read: Collected only for inference requests that use the - GRPC protocol. The time spent waiting for a request to arrive at the - server and for that request to be read. Because wait time is - included in the time it is not a useful measure of how much time is - spent reading a request from the network. Tracing an HTTP request - will provide an accurate measure of the read time. - -* HTTP Request Receive: Collected only for inference requests that use the - HTTP protocol. The time required to read the inference request from - the network. - -* Send: The time required to send the inference response. - -* Overhead: Additional time required in the HTTP or GRPC endpoint to - process the inference request and response. - -* Handler: The total time spent handling the inference request, not - including the HTTP and GRPC request/response handling. - - * Queue: The time the inference request spent in the scheduling queue. - - * Compute: The time the inference request spent executing the actual - inference. This time includes the time spent copying input and - output tensors. If --trace-level=TIMESTAMPS then a breakdown of the - compute time will be provided as follows: - - * Input: The time to copy input tensor data as required by the - inference framework / backend. This includes the time to copy - input tensor data to the GPU. - - * Infer: The time spent executing the model to perform the - inference. - - * Output: The time to copy output tensor data as required by the - inference framework / backend. This includes the time to copy - output tensor data from the GPU. - - * Overhead: Additional time required for request handling not - covered by Queue or Compute times. - -* Data Flow: The data flow of the first request. It contains the input and - output tensors of each part of execution. - - * Name: The name of model. - - * Version: The version of model. - - * QUEUE_INPUT: The tensor entering the queue of a backend to wait for - scheduling. - - * BACKEND_OUTPUT: The tensor in the response of a backend. diff --git a/docs/architecture.md b/docs/user_guide/architecture.md similarity index 96% rename from docs/architecture.md rename to docs/user_guide/architecture.md index c58d80f6b6..b343842014 100644 --- a/docs/architecture.md +++ b/docs/user_guide/architecture.md @@ -1,5 +1,5 @@ + +# Debugging Guide +This guide goes over first-step troubleshooting for common scenarios in which Triton is behaving unexpectedly or failing. Below, we break down the issues into these categories: + +- **[Configuration](#configuration-issues)**: Triton reports an error with your configuration file. +- **[Model](#model-issues)**: Your model fails to load or perform inference. +- Server: The server is crashing or unavailable. +- Client: The client is failing in sending and receiving data to the server. +- Performance: Triton is not achieving optimal performance. + +Regardless of the category of your issue, it is worthwhile to try running in the latest Triton container, whenever possible. While we provide support to older containers, fixes get merged into the next release. By checking the latest release, you can spot whether this issue has already been resolved. + +You can also search [Triton’s GitHub issues](https://github.com/triton-inference-server/server/issues) to see if someone previously asked about your issue. If you received an error, you can use a few keywords from the error as a search term. + +Triton provides different types of errors and statuses, relevant across a wide swath of issues. Here is an overview of them: + +| Error | Definition | Example | +| ----- | ---------- | ------- | +|Already Exists | Returned when an action cannot be done because there is already an existing item. | A registered model fails to be registered again.| +| Internal | Returned when there is an unexpected failure within the Triton code. | A memory allocation fails. | +| Invalid Arg | Returned when an invalid argument is provided to a function | A model config has an invalid parameter | +| Not Found | Returned when a requested resource is unable to be found | A shared library is unable to be found | +| Unavailable | Returned when a requested resource is found but unavailable | A requested model is not ready for inference | +| Unknown | Returned for cases where the reason for the error is unknown | This error code should not be used | +| Unsupported | Returned when an option is unsupported | A model config includes a parameter that is not yet supported for that backend | + +## Configuration Issues + +Before proceeding, please see if the model configuration documentation [here](./model_configuration.md) resolves your question. Beyond that, the best places to find a sample model configuration for your use cases are: + +- The server [qa folder](https://github.com/triton-inference-server/server/tree/main/qa). You can find test scripts covering most features, including some which update the model config files to do so. + - [Custom_models](https://github.com/triton-inference-server/server/tree/main/qa/custom_models), [ensemble_models](https://github.com/triton-inference-server/server/tree/main/qa/ensemble_models), and [python_models](https://github.com/triton-inference-server/server/tree/main/qa/python_models) include examples of configs for their respective use cases. + - [L0_model_config](https://github.com/triton-inference-server/server/tree/main/qa/L0_model_config) tests many types of incomplete model configs. + +Note that if you are running into an issue with [perf_analyzer](https://github.com/triton-inference-server/server/blob/main/docs/perf_analyzer.md) or [Model Analyzer](https://github.com/triton-inference-server/model_analyzer), try loading the model onto Triton directly. This checks if the configuration is incorrect or the perf_analyzer or Model Analyzer options need to be updated. + +## Model Issues +**Step 1. Run Models Outside of Triton** + +If you are running into an issue with loading or running a model, the first step is to ensure your model runs in its framework outside of Triton. For example, you can run ONNX models in ONNX Runtime and TensorRT models in trtexec. If this check fails, the issue is happening within the framework and not within Triton. + +**Step 2. Find the Error Message** + +If you receive an error message, you may be able to find where it was generated by searching the code. GitHub provides instructions for searching code [here](https://docs.github.com/en/search-github/searching-on-github/searching-code). A generic search through the Triton organization is available at [this link](https://github.com/search?q=org%3Atriton-inference-server&type=Code). + +If your error message only occurs in one or a few places in the Triton code, you may be able to see what’s going wrong pretty quickly. Even if not, it’s good to save this link to provide to us when asking for help with your issue. This is often the first thing we look for. + +**Step 3. Build with Debug Flags** + +The next step is building with debug flags. We unfortunately don’t provide a debug container, so you’d need to follow the [build guide](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md) to build the container, which includes a [section on adding debug symbols](https://github.com/triton-inference-server/server/blob/main/docs/build.md#building-with-debug-symbols). Once you do so, you can install GDB (`apt-get install gdb`) in the container and run Triton in GDB (`gdb --args tritonserver…`). If needed, you can open a second terminal to run a script in another container. If the server segfaults, you can enter `backtrace`, which will provide you a call stack that lets you know where the error got generated. You should then be able to trace the source of the error. If the bug still exists after debugging, we’ll need this to expedite our work. + +Advanced GDB users can also examine variable values, add breakpoints, and more to find the cause of their issue. + +### Specific Issues +**Undefined Symbols** + +There are a few options here: +- This often means a version mismatch between the version of a framework used by Triton and the one used to create the model. Check the version of the framework used in the Triton container and compare against the version used to generate the model. +- If you are loading a shared library used by a backend, don’t forget to include LD_PRELOAD before the command to run Tritonserver.  + - `LD_PRELOAD= tritonserver --model-repository…` +If you built the backend yourself, this could be a linking error. If you are confident the backends and server were built correctly, double check that the server is loading the correct backend. + +## Server Issues + +You generally should not run into errors with the server itself. If the server goes down, it’s usually because something went wrong during model loading or inference and you can use the above section to debug. It’s particularly useful to work through the [Building with Debug Flags](https://github.com/triton-inference-server/server/blob/main/docs/build.md#building-with-debug-symbols) section above to resolve those sorts of issues. However, this section will go through some specific cases that may occur. + +### No Connection to Server + +If you are having trouble connecting to the server or getting its health via the health endpoint (`curl -v localhost:8000/v2/health/ready`), make sure you are able to reach the network your server is running on from where you are running your command. Most commonly, we see that when separate Docker containers are started for the client and server, they are not started with [--net=host](https://docs.docker.com/network/host/) to share the network. + +### Intermittent Failure + +This is going to be one of the hardest things to debug. If possible, you want to build your server with debug flags to get a backtrace of what is happening specifically. You would also want to keep notes to see how often this happens and whether that is a common cause. The server itself should not fail while idling, so see if a certain action (loading/unloading a model, running a model inference, etc.) is triggering it. + +### Server Failure Due to Individual Models + +If you want the server to start up even when models fail, use the `exit-on-error=false` option. If you want the server health endpoint to show ready even when specific models fail, use the `--strict-readiness=false` flag. + +### Deadlock + +Some useful steps for debugging a deadlock with `gdb`: +1. Use `$info threads` to see which threads are waiting. +2. Go to a thread: `$thread 4`. +3. Print the backtrace: `$bt`. +4. Go to the frame with the lock: `$f 1`. +5. Print the memory of the mutex being held: `$p *mutex`. +6. You can now see the owner of the mutex under `owner`. + +## Client Issues + +For working with different client cases, the best resources are the [client repo’s](https://github.com/triton-inference-server/client) examples. You can see clients written in Python, Java, and C++ with running examples across many common use cases. You can review the main functions of these clients to get a sense of the flow of the code. + +We often get performance optimization questions around the clients. Triton clients send input tensors as raw binary. However, GRPC uses protobuf which has some serialization and deserialization overhead. For those looking for the lowest-latency solution, C API eliminates the latency associated with GRPC/HTTP. Shared memory is also a good option to reduce data movement when the client and server are on the same system. + +## Performance Issues + +This section goes over debugging unexpected performance. If you are looking to optimize performance, please see the [Optimization](https://github.com/triton-inference-server/server/blob/main/docs/optimization.md) and [Performance Tuning](https://github.com/triton-inference-server/server/blob/main/docs/performance_tuning.md) guides. + +The easiest step to start with is running perf_analyzer to get a breakdown of the request lifecycle, throughput, and latency for each individual model. For a more detailed view, you can [enable tracing](https://github.com/triton-inference-server/server/blob/main/docs/trace.md) when running the server. This will provide exact timestamps to drill down into what is happening. You can also enable tracing with perf_analyzer for the GRPC and HTTP clients by using the tracing flags. Note that enabling tracing can impact Triton’s performance, but it can be helpful to examine the timestamps throughout a request’s lifecycle. + +### Performance Profiling + +The next step would be to use a performance profiler. One profiler we recommend is [Nsight Systems](https://developer.nvidia.com/nsight-systems) (nsys), optionally including NVIDIA Tools Extension (NVTX) markers to profile Triton. + +The Triton server container already has nsys installed. However, Triton does not build with the NVTX markers by default. If you want to use NVTX markers, you should build Triton with build.py, using the “--enable-nvtx” flag. This will provide details around some phases of processing a request, such as queueing, running inference, and handling outputs. + +You can profile Triton by running `nsys profile tritonserver --model-repository …`. The [nsys documentation](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) provides more options and details for getting a thorough overview of what is going on. + +## Submitting an Issue + +If you’ve done the initial debugging steps with no results, the next step is to submit the issue to us. Before you do so, please answer these questions: +- Is this reproducible with multiple models and/or our example models? Or is the issue unique to your model? +- Is the bug reproducible with any protocol (ex: HTTP vs GRPC)? Or only one protocol? + +The answers to the above should inform what you submit. If you find that this issue only happens under specific circumstances, please include this in your report. If the issue still exists, please submit **all** of the below: + +- The commands or script used to build/pull Triton and run your models. + - If building Triton, please provide the version or branch you are building from. +- Your model configuration file. +- The error received, plus any logs. + - If your issue involves the server crashing, a backtrace of the dump would be helpful. + - Please enable verbose logging (--verbose-log=1) to get the most detailed logs. +- If this issue is unique to your model, your model or a toy model that reproduces the issue. +- Anything else that would expedite our investigation. diff --git a/docs/decoupled_models.md b/docs/user_guide/decoupled_models.md similarity index 73% rename from docs/decoupled_models.md rename to docs/user_guide/decoupled_models.md index 23ac7febe7..fbe6f4c298 100644 --- a/docs/decoupled_models.md +++ b/docs/user_guide/decoupled_models.md @@ -1,5 +1,5 @@ + +# Metrics + +Triton provides [Prometheus](https://prometheus.io/) metrics +indicating GPU and request statistics. By default, these metrics are +available at http://localhost:8002/metrics. The metrics are only +available by accessing the endpoint, and are not pushed or published +to any remote server. The metric format is plain text so you can view +them directly, for example: + +``` +$ curl localhost:8002/metrics +``` + +The `tritonserver --allow-metrics=false` option can be used to disable +all metric reporting, while the `--allow-gpu-metrics=false` and +`--allow-cpu-metrics=false` can be used to disable just the GPU and CPU +metrics respectively. + +The `--metrics-port` option can be used to select a different port. By default, +Triton reuses the `--http-address` option for the metrics endpoint and binds the +http and metrics endpoints to the same specific address when http service is +enabled. If http service is not enabled, the metric address will bind to `0.0.0.0` +by default. To uniquely specify the metric endpoint, `--metrics-address` option +can be used. See the `tritonserver --help` output for more info on these CLI options. + +To change the interval at which metrics are polled/updated, see the `--metrics-interval-ms` flag. Metrics that are updated "Per Request" are unaffected by this interval setting. This interval only applies to metrics that are designated as "Per Interval" in the tables of each section below: + +- [Inference Request Metrics](#inference-request-metrics) +- [GPU Metrics](#gpu-metrics) +- [CPU Metrics](#cpu-metrics) +- [Response Cache Metrics](#response-cache-metrics) +- [Custom Metrics](#custom-metrics) + +## Inference Request Metrics + +### Counts + +For models that do not support batching, *Request Count*, *Inference +Count* and *Execution Count* will be equal, indicating that each +inference request is executed separately. + +For models that support batching, the count metrics can be interpreted +to determine average batch size as *Inference Count* / *Execution +Count*. The count metrics are illustrated by the following examples: + +* Client sends a single batch-1 inference request. *Request Count* = + 1, *Inference Count* = 1, *Execution Count* = 1. + +* Client sends a single batch-8 inference request. *Request Count* = + 1, *Inference Count* = 8, *Execution Count* = 1. + +* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is not + enabled for the model. *Request Count* = 2, *Inference Count* = 9, + *Execution Count* = 2. + +* Client sends 2 requests: batch-1 and batch-1. Dynamic batcher is + enabled for the model and the 2 requests are dynamically batched by + the server. *Request Count* = 2, *Inference Count* = 2, *Execution + Count* = 1. + +* Client sends 2 requests: batch-1 and batch-8. Dynamic batcher is + enabled for the model and the 2 requests are dynamically batched by + the server. *Request Count* = 2, *Inference Count* = 9, *Execution + Count* = 1. + +|Category |Metric |Metric Name |Description |Granularity|Frequency | +|--------------|----------------|------------|---------------------------|-----------|-------------| +|Count |Success Count |`nv_inference_request_success` |Number of successful inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model |Per request | +| |Failure Count |`nv_inference_request_failure` |Number of failed inference requests received by Triton (each request is counted as 1, even if the request contains a batch) |Per model |Per request | +| |Inference Count |`nv_inference_count` |Number of inferences performed (a batch of "n" is counted as "n" inferences, does not include cached requests)|Per model|Per request| +| |Execution Count |`nv_inference_exec_count` |Number of inference batch executions (see [Inference Request Metrics](#inference-request-metrics), does not include cached requests)|Per model|Per request| +| |Pending Request Count |`nv_inference_pending_request_count` |Number of inference requests awaiting execution by a backend. This number is incremented when a request is enqueued to the server (`TRITONSERVER_ServerInferAsync`) and is decremented when a backend is about to start executing the request. More details can be found below. |Per model|Per request| + +#### Pending Request Count (Queue Size) Per-Model + +The *Pending Request Count* reflects the number of requests that have been +received by Triton core via `TRITONSERVER_InferAsync`, but have not yet +started execution by a backend model instance +(`TRITONBACKEND_ModelInstanceExecute`). + +For all intents and purposes, the +"pending request count" and "queue size" per-model can be used +interchangeably, and the number reflected in the metric should +intuitively represent the number of requests that are not currently +being executed by any model instances. In simple terms, if you send a 100 +requests to a model that can only handle 5 requests concurrently, then you +should see a pending count of 95 for that model in most cases. + +For those interested in more technical details, the term "pending request count" +is a bit more accurate than "queue size" because Triton is highly configurable, +and there are many places in Triton that a request be considered pending rather +than a single queue. Some of the most common will be called out below: +- Default Scheduler backlogs any requests not currently executing. + - Assuming 1 available model instance with the default scheduler settings, + and 10 requests are sent in rapid succession. + - The 1st request should be picked up for + execution immediately, and the remaining 9 requests should be considered + pending for this model, until the 1st request is finished. Afterwards, the + next request should be picked up and the pending count should be decremented + to 8, and so on until all requests are finished and the pending count is 0. +- Dynamic Batcher queue for dynamically creating batches from requests. + - Assuming 1 available model instance with the dynamic batch scheduler + configured with `max_batch_size: 4` and a sufficiently large + `max_queue_delay_microseconds` (or queue of requests), + and 10 requests are sent in rapid succession. + - The first 4 requests, or as large of a batch the scheduler could form, + should be picked up for execution immediately, and the remaining 6 requests + should be considered pending. After the batch finishes, the next batch + should be picked up, decrementing the pending count again to 2 pending. + Then finally since only 2 requests remain, the final 2 requests will be + batched and picked up by the backend, decrementing the pending count to 0. +- Sequence Batcher queues and backlogs for ongoing sequence requests, some may + be assigned sequence slots, some may not. + - Sequence Batchers of both strategies (direct and oldest) will have pending + counts that generally follow the same trend as the dynamic batching + description above. The sequence batchers will immediately execute as many + requests in a batch as it can based on the model/scheduler config settings, + and any further requests will be considered pending until the previous batch + finishes and the next batch can start. +- Rate Limiter queues for prepared batches of requests. + - When rate limiting is enabled, requests can be held back from execution + to satisfy the rate limit constraints that were configured. + +There are some places where a request would not be considered pending: +- Ensemble Scheduler + - The Ensemble Scheduler almost immediately enqueues any requests it receives + into the composing model schedulers at the first step in the ensemble. + Therefore, the requests could be considered pending by the composing model + scheduler's, however from the ensemble's perspective, these requests have been + scheduled. +- Frontends (HTTP/GRPC Servers) + - Any requests sent from a client to a frontend server in-front of Triton + may spend some time in the corresponding server's code mapping + protocol-specific metadata to Triton metadata. Though this time is + generally brief, it will not be considered pending from Triton's + perspective until Triton core has received the request from the frontend. + +### Latencies + +Starting in 23.04, Triton exposes the ability to choose the types of metrics +that are published through the `--metrics-config` CLI options. + +#### Counters + +By default, the following +[Counter](https://prometheus.io/docs/concepts/metric_types/#counter) +metrics are used for latencies: + +|Category |Metric |Metric Name |Description |Granularity|Frequency | +|--------------|----------------|------------|---------------------------|-----------|-------------| +|Latency |Request Time |`nv_inference_request_duration_us` |Cumulative end-to-end inference request handling time (includes cached requests) |Per model |Per request | +| |Queue Time |`nv_inference_queue_duration_us` |Cumulative time requests spend waiting in the scheduling queue (includes cached requests) |Per model |Per request | +| |Compute Input Time|`nv_inference_compute_input_duration_us` |Cumulative time requests spend processing inference inputs (in the framework backend, does not include cached requests) |Per model |Per request | +| |Compute Time |`nv_inference_compute_infer_duration_us` |Cumulative time requests spend executing the inference model (in the framework backend, does not include cached requests) |Per model |Per request | +| |Compute Output Time|`nv_inference_compute_output_duration_us` |Cumulative time requests spend processing inference outputs (in the framework backend, does not include cached requests) |Per model |Per request | + +To disable these metrics specifically, you can set `--metrics-config counter_latencies=false` + +#### Summaries + +> **Note** +> +> The following Summary feature is experimental for the time being and may be +> subject to change based on user feedback. + +To get configurable quantiles over a sliding time window, Triton supports +a set a [Summary](https://prometheus.io/docs/concepts/metric_types/#summary) +metrics for latencies as well. These metrics are disabled by default, but can +be enabled by setting `--metrics-config summary_latencies=true`. + +For more information on how the quantiles are calculated, see +[this explanation](https://grafana.com/blog/2022/03/01/how-summary-metrics-work-in-prometheus/). + +The following summary metrics are available: + +|Category |Metric |Metric Name |Description |Granularity|Frequency | +|--------------|----------------|------------|---------------------------|-----------|-------------| +|Latency |Request Time |`nv_inference_request_summary_us` |Summary of end-to-end inference request handling times (includes cached requests) |Per model |Per request | +| |Queue Time |`nv_inference_queue_summary_us` |Summary of time requests spend waiting in the scheduling queue (includes cached requests) |Per model |Per request | +| |Compute Input Time|`nv_inference_compute_input_summary_us` |Summary time requests spend processing inference inputs (in the framework backend, does not include cached requests) |Per model |Per request | +| |Compute Time |`nv_inference_compute_infer_summary_us` |Summary of time requests spend executing the inference model (in the framework backend, does not include cached requests) |Per model |Per request | +| |Compute Output Time|`nv_inference_compute_output_summary_us` |Summary of time requests spend processing inference outputs (in the framework backend, does not include cached requests) |Per model |Per request | + +Each summary above is actually composed of several sub-metrics. For each +metric, there is a set of `quantile` metrics tracking the latency for each +quantile. Additionally, there are `_count` and `_sum` metrics that aggregate +the count and observed values for each. For example, see the following +information exposed by the Inference Queue Summary metrics: +``` +# HELP nv_inference_queue_summary_us Summary of inference queuing duration in microseconds (includes cached requests) +# TYPE nv_inference_queue_summary_us summary +nv_inference_queue_summary_us_count{model="my_model",version="1"} 161 +nv_inference_queue_summary_us_sum{model="my_model",version="1"} 11110 +nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.5"} 55 +nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.9"} 97 +nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.95"} 98 +nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.99"} 101 +nv_inference_queue_summary_us{model="my_model",version="1",quantile="0.999"} 101 +``` + +The count and sum for the summary above show that stats have been recorded for +161 requests, and took a combined total of 11110 microseconds. The `_count` and +`_sum` of a summary should generally match the counter metric equivalents when +applicable, such as: +``` +nv_inference_request_success{model="my_model",version="1"} 161 +nv_inference_queue_duration_us{model="my_model",version="1"} 11110 +``` + +Triton has a set of default quantiles to track, as shown above. To set +custom quantiles, you can use the `--metrics-config` CLI option. The format is: +``` +tritonserver --metrics-config summary_quantiles=":,...,:"` +``` + +For example: +``` +tritonserver --metrics-config summary_quantiles="0.5:0.05,0.9:0.01,0.95:0.001,0.99:0.001"` +``` + +To better understand the setting of error values for computing each quantile, see the +[best practices for histograms and summaries](https://prometheus.io/docs/practices/histograms/#histograms-and-summaries). + + +## GPU Metrics + +GPU metrics are collected through the use of [DCGM](https://developer.nvidia.com/dcgm). +Collection of GPU metrics can be toggled with the `--allow-gpu-metrics` CLI flag. +If building Triton locally, the `TRITON_ENABLE_METRICS_GPU` CMake build flag can be used to toggle building the relevant code entirely. + +|Category |Metric |Metric Name |Description |Granularity|Frequency | +|----------------|------------------|----------------------------|-------------------------------------------------------|-----------|-------------| +|GPU Utilization |Power Usage |`nv_gpu_power_usage` |GPU instantaneous power, in watts |Per GPU |Per interval | +| |Power Limit |`nv_gpu_power_limit` |Maximum GPU power limit, in watts |Per GPU |Per interval | +| |Energy Consumption|`nv_energy_consumption` |GPU energy consumption since Triton started, in joules |Per GPU |Per interval | +| |GPU Utilization |`nv_gpu_utilization` |GPU utilization rate (0.0 - 1.0) |Per GPU |Per interval | +|GPU Memory |GPU Total Memory |`nv_gpu_memory_total_bytes` |Total GPU memory, in bytes |Per GPU |Per interval | +| |GPU Used Memory |`nv_gpu_memory_used_bytes` |Used GPU memory, in bytes |Per GPU |Per interval | + + +## CPU Metrics + +Collection of CPU metrics can be toggled with the `--allow-cpu-metrics` CLI flag. +If building Triton locally, the `TRITON_ENABLE_METRICS_CPU` CMake build flag can be used to toggle building the relevant code entirely. + +> **Note** +> +> CPU Metrics are currently only supported on Linux. +> They collect information from the [/proc filesystem](https://www.kernel.org/doc/html/latest/filesystems/proc.html) such as `/proc/stat` and `/proc/meminfo`. + +|Category |Metric |Metric Name |Description |Granularity|Frequency | +|--------------|----------------|------------|---------------------------|-----------|-------------| +|CPU Utilization | CPU Utilization | `nv_cpu_utilization` | Total CPU utilization rate [0.0 - 1.0] | Aggregated across all cores since last interval | Per interval | +|CPU Memory | CPU Total Memory | `nv_cpu_memory_total_bytes` | Total CPU memory (RAM), in bytes | System-wide | Per interval | +| | CPU Used Memory | `nv_cpu_memory_used_bytes` | Used CPU memory (RAM), in bytes | System-wide | Per interval | + +## Response Cache Metrics + +Cache metrics can be reported in two ways: + +1. A base set of cache metrics will be reported +by Triton directly, such as the cache hit/miss counts and durations described +below. + +2. As of 23.03, additional cache metrics may be reported depending on the +[cache implementation](response_cache.md#cache-implementations) +being used through Triton's [Metrics API](#custom-metrics). + +### Triton-reported Response Cache Metrics + +Compute latency metrics in the +[Inference Request Metrics table](#inference-request-metrics) above are +calculated for the time spent in model inference backends. If the response +cache is enabled for a given model (see [Response Cache](response_cache.md) +docs for more info), total inference times may be affected by response cache +lookup times. + +On cache hits, "Cache Hit Time" indicates the time spent looking up the +response, and "Compute Input Time" / "Compute Time" / "Compute Output Time" +are not recorded. + +On cache misses, "Cache Miss Time" indicates the time spent looking up +the request hash and inserting the computed output tensor data into the cache. +Otherwise, "Compute Input Time" / "Compute Time" / "Compute Output Time" will +be recorded as usual. + +|Category |Metric |Metric Name |Description |Granularity|Frequency | +|--------------|----------------|------------|---------------------------|-----------|-------------| +|Count |Cache Hit Count |`nv_cache_num_hits_per_model` |Number of response cache hits per model |Per model |Per request | +| |Cache Miss Count |`nv_cache_num_misses_per_model` |Number of response cache misses per model |Per model |Per request | +|Latency |Cache Hit Time |`nv_cache_hit_duration_per_model` |Cumulative time requests spend retrieving a cached response per model on cache hits (microseconds) |Per model |Per request | +| |Cache Miss Time |`nv_cache_miss_duration_per_model` |Cumulative time requests spend looking up and inserting responses into the cache on a cache miss (microseconds) |Per model |Per request | + +Similar to the Summaries section above for Inference Request Metrics, the +per-model cache hit/miss latency metrics also support Summaries. + +> **Note** +> +> For models with response caching enabled, the inference request **summary** metric +> is currently disabled. This is due to extra time spent internally on cache +> management that wouldn't be reflected correctly in the end to end request time. +> Other summary metrics are unaffected. + +## Custom Metrics + +Triton exposes a C API to allow users and backends to register and collect +custom metrics with the existing Triton metrics endpoint. The user takes the +ownership of the custom metrics created through the APIs and must manage their +lifetime following the API documentation. + +The +[identity_backend](https://github.com/triton-inference-server/identity_backend/blob/main/README.md#custom-metric-example) +demonstrates a practical example of adding a custom metric to a backend. + +Further documentation can be found in the `TRITONSERVER_MetricFamily*` and +`TRITONSERVER_Metric*` API annotations in +[tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). diff --git a/docs/user_guide/model_analyzer.md b/docs/user_guide/model_analyzer.md new file mode 100644 index 0000000000..663a8a277a --- /dev/null +++ b/docs/user_guide/model_analyzer.md @@ -0,0 +1,45 @@ + + +# Model Analyzer + +The Triton [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) + is a tool that uses +[Performance Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) +to send requests to your model while measuring GPU memory and compute +utilization. The Model Analyzer is specifically useful for characterizing the +GPU memory requirements for your model under different batching and model +instance configurations. Once you have this GPU memory usage information you can +more intelligently decide on how to combine multiple models on the same GPU +while remaining within the memory capacity of the GPU. + +For more detailed examples and explanations of using Model Analyzer, see: +- [Model Analyzer Conceptual Guide](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_3-optimizing_triton_configuration) +- [Maximizing Deep Learning +Inference Performance with NVIDIA Model +Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer) \ No newline at end of file diff --git a/docs/model_configuration.md b/docs/user_guide/model_configuration.md similarity index 76% rename from docs/model_configuration.md rename to docs/user_guide/model_configuration.md index e1062186d2..241301ade7 100644 --- a/docs/model_configuration.md +++ b/docs/user_guide/model_configuration.md @@ -1,5 +1,5 @@ + +Perf Analyzer documentation has been relocated to +[here](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md). diff --git a/docs/user_guide/performance_tuning.md b/docs/user_guide/performance_tuning.md new file mode 100644 index 0000000000..e28789a2d3 --- /dev/null +++ b/docs/user_guide/performance_tuning.md @@ -0,0 +1,393 @@ + + +# Deploying your trained model using Triton + +Given a trained model, how do I deploy it at-scale with an optimal configuration +using Triton Inference Server? This document is here to help answer that. + +For those who like a [high level overview](#overview), below is the common flow +for most use cases. + +For those who wish to jump right in, skip to the +[end-to-end example](#end-to-end-example). + +For additional material, see the +[Triton Conceptual Guide tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_4-inference_acceleration). + +## Overview + +1. Is my model compatible with Triton? + - If your model falls under one of Triton's + [supported backends](https://github.com/triton-inference-server/backend), + then we can simply try to deploy the model as described in the + [Quickstart](../getting_started/quickstart.md) guide. + For the ONNXRuntime, TensorFlow SavedModel, and TensorRT backends, the + minimal model configuration can be inferred from the model using Triton's + [AutoComplete](model_configuration.md#auto-generated-model-configuration) + feature. + This means that a `config.pbtxt` may still be provided, but is not required + unless you want to explicitly set certain parameters. + Additionally, by enabling verbose logging via `--log-verbose=1`, you can see + the complete config that Triton sees internally in the server log output. + For other backends, refer to the + [Minimal Model Configuration](model_configuration.md#minimal-model-configuration) + required to get started. + - If your model does not come from a supported backend, you can look into + the [Python Backend](https://github.com/triton-inference-server/python_backend) + or writing a + [Custom C++ Backend](https://github.com/triton-inference-server/backend/blob/main/examples/README.md) + to support your model. The Python Backend provides a simple interface to + execute requests through a generic python script, but may not be as + performant as a Custom C++ Backend. Depending on your use case, the Python + Backend performance may be a sufficient tradeoff for the simplicity of + implementation. + +2. Can I run inference on my served model? + - Assuming you were able to load your model on Triton, the next step is to + verify that we can run inference requests and get a baseline performance + benchmark of your model. + Triton's + [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) + tool specifically fits this purpose. Here is a simplified output for + demonstration purposes: + + ``` + # NOTE: "my_model" represents a model currently being served by Triton + $ perf_analyzer -m my_model + ... + + Inferences/Second vs. Client Average Batch Latency + Concurrency: 1, throughput: 482.8 infer/sec, latency 12613 usec + ``` + + - This gives us a sanity test that we are able to successfully form input + requests and receive output responses to communicate with the model backend + via Triton APIs. + - If Perf Analyzer fails to send requests and it is unclear from the error + how to proceed, then you may want to sanity check that your model + `config.pbtxt` inputs/outputs match what the model expects. If the config + is correct, check that the model runs successfully using its original + framework directly. If you don't have your own script or tool to do so, + [Polygraphy](https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy) + is a useful tool to run sample inferences on your model via various + frameworks. Currently, Polygraphy supports ONNXRuntime, TensorRT, and + TensorFlow 1.x. + - The definition of "performing well" is subject to change for each use + case. Some common metrics are throughput, latency, and GPU utilization. + There are many variables that can be tweaked just within your model + configuration (`config.pbtxt`) to obtain different results. + - As your model, config, or use case evolves, + [Perf Analyzer](https://github.com/triton-inference-server/client/blob/main/src/c++/perf_analyzer/README.md) + is a great tool to quickly verify model functionality and performance. + +3. How can I improve my model performance? + - To further understand the best model configuration you can provide to + Triton for your use case, Triton's + [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) + tool can help. + Model Analyzer can automatically or + [manually](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md) + search through config combinations to find the optimal triton configuration + to meet your constraints. After running Model Analyzer to find the optimal + configurations for your model/use case, you can transfer the generated + config files to your [Model Repository](model_repository.md). + Model Analyzer provides a + [Quickstart](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/quick_start.md) + guide with some examples to walk through. + - Upon serving the model with the newly optimized configuration file found + by Model Analyzer and running Perf Analyzer again, you should expect to find + better performance numbers in most cases compared to a default config. + - Some parameters that can be tuned for a model may not be exposed to Model + Analyzer's automatic search since they don't apply to all models. + For instance, [backends](https://github.com/triton-inference-server/backend) + can expose backend-specific configuration options that can be tuned as well. + The [ONNXRuntime + Backend](https://github.com/triton-inference-server/onnxruntime_backend), + for example, has several + [parameters](https://github.com/triton-inference-server/onnxruntime_backend#model-config-options) + that affect the level of parallelization when executing inference on a + model. + These backend-specific options may be worth investigating if the defaults + are not providing sufficient performance. To tune custom sets of + parameters, Model Analyzer supports + [Manual Configuration Search](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/config_search.md). + - To learn more about further optimizations for your model configuration, + see the [Optimization](optimization.md) docs. + +### Other Areas of Interest + +1. My model performs slowly when it is first loaded by Triton +(cold-start penalty), what do I do? + - Triton exposes the ability to run + [ModelWarmup](model_configuration.md#model-warmup) requests when first + loading the model to ensure that the model is sufficiently warmed up before + being marked "READY" for inference. + +2. Why doesn't my model perform significantly faster on GPU? + - Most official backends supported by Triton are optimized for GPU inference + and should perform well on GPU out of the box. + - Triton exposes options for you to optimize your model further on the GPU. + Triton's + [Framework Specific Optimizations](optimization.md#framework-specific-optimization) + goes into further detail on this topic. + - Complete conversion of your model to a backend fully optimized for GPU + inference such as [TensorRT](https://developer.nvidia.com/tensorrt) may + provide even better results. + You may find more Triton-specific details about TensorRT in the + [TensorRT Backend](https://github.com/triton-inference-server/tensorrt_backend). + - If none of the above can help get sufficient GPU-accelerated performance + for your model, the model may simply be better designed for CPU execution + and the [OpenVINO Backend](https://github.com/triton-inference-server/openvino_backend) may + help further optimize your CPU execution. + +## End-to-end Example + +> **Note** +> If you have never worked with Triton before, you may be interested in first +checking out the [Quickstart](../getting_started/quickstart.md) example. +> Some basic understanding of Triton may be useful for the following section, +but this example is meant to be straightforward enough without prior experience. + +Let's take an ONNX model as our example since ONNX is designed to be a format +that can be [easily +exported](https://github.com/onnx/tutorials#converting-to-onnx-format) from most +other frameworks. + +1. Create a [Model Repository](model_repository.md) and download our example +`densenet_onnx` model into it. + +```bash +# Create model repository with placeholder for model and version 1 +mkdir -p ./models/densenet_onnx/1 + +# Download model and place it in model repository +wget -O models/densenet_onnx/1/model.onnx +https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx +``` + +2. Create a minimal [Model Configuration](model_configuration.md) for the +`densenet_onnx` model in our [Model Repository](model_repository.md) at +`./models/densenet_onnx/config.pbtxt`. + +> **Note** +> This is a slightly simplified version of another [example +config](../examples/model_repository/densenet_onnx/config.pbtxt) that utilizes +other [Model Configuration](model_configuration.md) features not necessary for +this example. + +```protobuf +name: "densenet_onnx" +backend: "onnxruntime" +max_batch_size: 0 +input: [ + { + name: "data_0", + data_type: TYPE_FP32, + dims: [ 1, 3, 224, 224] + } +] +output: [ + { + name: "prob_1", + data_type: TYPE_FP32, + dims: [ 1, 1000, 1, 1 ] + } +] +``` + +> **Note** +> As of the 22.07 release, both Triton and Model Analyzer support fully +auto-completing the config file for +[backends that support it](model_configuration.md#auto-generated-model-configuration). +> So for an ONNX model, for example, this step can be skipped unless you want to +explicitly set certain parameters. + +3. Start the server container + +To serve our model, we will use the server container which comes pre-installed +with a `tritonserver` binary. + +```bash +# Start server container +docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-server nvcr.io/nvidia/tritonserver:23.11-py3 + +# Start serving your models +tritonserver --model-repository=/mnt/models +``` + +> **Note** +> The `-v $PWD:/mnt` is mounting your current directory on the host into the +`/mnt` directory inside the container. +> So if you created your model repository in `$PWD/models`, you will find it +inside the container at `/mnt/models`. +> You can change these paths as needed. See +[docker volume](https://docs.docker.com/storage/volumes/) docs for more information on +how this works. + + +To check if the model loaded successfully, we expect to see our model in a +`READY` state in the output of the previous command: + +``` +... +I0802 18:11:47.100537 135 model_repository_manager.cc:1345] successfully loaded 'densenet_onnx' version 1 +... ++---------------+---------+--------+ +| Model | Version | Status | ++---------------+---------+--------+ +| densenet_onnx | 1 | READY | ++---------------+---------+--------+ +... +``` + +4. Verify the model can run inference + +To verify our model can perform inference, we will use the `triton-client` +container that we already started which comes with `perf_analyzer` +pre-installed. + +In a separate shell, we use Perf Analyzer to sanity check that we can run +inference and get a baseline for the kind of performance we expect from this +model. + +In the example below, Perf Analyzer is sending requests to models served on the +same machine (`localhost` from the server container via `--network=host`). +However, you may also test models being served remotely at some `:` +by setting the `-u` flag, such as `perf_analyzer -m densenet_onnx -u +127.0.0.1:8000`. + +```bash +# Start the SDK container interactively +docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-client nvcr.io/nvidia/tritonserver:23.11-py3-sdk + +# Benchmark model being served from step 3 +perf_analyzer -m densenet_onnx --concurrency-range 1:4 +``` + +``` +... +Inferences/Second vs. Client Average Batch Latency +Concurrency: 1, throughput: 265.147 infer/sec, latency 3769 usec +Concurrency: 2, throughput: 890.793 infer/sec, latency 2243 usec +Concurrency: 3, throughput: 937.036 infer/sec, latency 3199 usec +Concurrency: 4, throughput: 965.21 infer/sec, latency 4142 usec +``` + +5. Run Model Analyzer to find the best configurations for our model + +While Model Analyzer comes pre-installed in the SDK (client) container and +supports various modes of connecting to a Triton server, for simplicity we will +use install Model Analyzer in our `server` container to use the `local` +(default) mode. +To learn more about other methods of connecting Model Analyzer to a running +Triton Server, see the `--triton-launch-mode` Model Analyzer flag. + +```bash +# Enter server container interactively +docker exec -ti triton-server bash + +# Stop existing tritonserver process if still running +# because model-analyzer will start its own server +SERVER_PID=`ps | grep tritonserver | awk '{ printf $1 }'` +kill ${SERVER_PID} + +# Install model analyzer +pip install --upgrade pip +pip install triton-model-analyzer wkhtmltopdf + +# Profile the model using local (default) mode +# NOTE: This may take some time, in this example it took ~10 minutes +model-analyzer profile \ + --model-repository=/mnt/models \ + --profile-models=densenet_onnx \ + --output-model-repository-path=results + +# Summarize the profiling results +model-analyzer analyze --analysis-models=densenet_onnx +``` + +Example Model Analyzer output summary: + +> In 51 measurements across 6 configurations, `densenet_onnx_config_3` provides +the best throughput: **323 infer/sec**. +> +> **This is a 92% gain over the default configuration (168 infer/sec), under the +given constraints.** + +| Model Config Name | Max Batch Size | Dynamic Batching | Instance Count | p99 Latency (ms) | Throughput (infer/sec) | Max GPU Memory Usage (MB) | Average GPU Utilization (%) | +|---|---|---|---|---|---|---|---| +| densenet_onnx_config_3 | 0 | Enabled | 4/GPU | 35.8 | 323.13 | 3695 | 58.6 | +| densenet_onnx_config_2 | 0 | Enabled | 3/GPU | 59.575 | 295.82 | 3615 | 58.9 | +| densenet_onnx_config_4 | 0 | Enabled | 5/GPU | 69.939 | 291.468 | 3966 | 58.2 | +| densenet_onnx_config_default | 0 | Disabled | 1/GPU | 12.658 | 167.549 | 3116 | 51.3 | + +In the table above, we see that setting our GPU [Instance +Count](model_configuration.md#instance-groups) to 4 allows us to achieve the +highest throughput and almost lowest latency on this system. + +Also, note that this `densenet_onnx` model has a fixed batch-size that is +explicitly specified in the first dimension of the Input/Output `dims`, +therefore the `max_batch_size` parameter is set to 0 as described +[here](model_configuration.md#maximum-batch-size). +For models that support dynamic batch size, Model Analyzer would also tune the +`max_batch_size` parameter. + +> **Warning** +> These results are specific to the system running the Triton server, so for +example, on a smaller GPU we may not see improvement from increasing the GPU +instance count. +> In general, running the same configuration on systems with different hardware +(CPU, GPU, RAM, etc.) may provide different results, so it is important to +profile your model on a system that accurately reflects where you will deploy +your models for your use case. + +6. Extract optimal config from Model Analyzer results + +In our example above, `densenet_onnx_config_3` was the optimal configuration. +So let's extract that `config.pbtxt` and put it back in our model repository for future use. + +```bash +# (optional) Backup our original config.pbtxt (if any) to another directory +cp /mnt/models/densenet_onnx/config.pbtxt /tmp/original_config.pbtxt + +# Copy over the optimal config.pbtxt from Model Analyzer results to our model repository +cp ./results/densenet_onnx_config_3/config.pbtxt /mnt/models/densenet_onnx/ +``` + +Now that we have an optimized Model Configuration, we are ready to take our +model to deployment. For further manual tuning, read the [Model +Configuration](model_configuration.md) and [Optimization](optimization.md) docs +to learn more about Triton's complete set of capabilities. + +In this example, we happened to get both the highest throughput and almost +lowest latency from the same configuration, but in some cases this is a tradeoff +that must be made. Certain models or configurations may achieve a higher +throughput but also incur a higher latency in return. It is worthwhile to fully +inspect the reports generated by Model Analyzer to ensure your model performance +meets your requirements. diff --git a/docs/ragged_batching.md b/docs/user_guide/ragged_batching.md similarity index 97% rename from docs/ragged_batching.md rename to docs/user_guide/ragged_batching.md index 3e69beb912..308b75fa57 100644 --- a/docs/ragged_batching.md +++ b/docs/user_guide/ragged_batching.md @@ -57,12 +57,13 @@ How ragged input are processed in a batch of requests depends on the backend implementation. The backends, such as [ONNX Runtime backend](https://github.com/triton-inference-server/onnxruntime_backend), [TensorFlow backend](https://github.com/triton-inference-server/tensorflow_backend), +[PyTorch backend](https://github.com/triton-inference-server/pytorch_backend), and [TensorRT backend](https://github.com/triton-inference-server/tensorrt_backend), require models to accept ragged inputs as 1-dimensional tensors. These backends concatenates the request inputs into the 1-dimensional tensor. Because the concatenated input doesn't track the start and end index for each -request, the backends also require the model to have additional input(s), +request, the backends often require the model to have additional input(s), [batch input](#batch-input), that describe various information about the batch formed. diff --git a/docs/rate_limiter.md b/docs/user_guide/rate_limiter.md similarity index 98% rename from docs/rate_limiter.md rename to docs/user_guide/rate_limiter.md index 2e38327042..69b94fd8b8 100644 --- a/docs/rate_limiter.md +++ b/docs/user_guide/rate_limiter.md @@ -42,9 +42,9 @@ frameworks dynamically allocate memory. Running all such models simultaneously may lead to system going out-of-memory. Rate limiter allows to postpone the inference execution on some -model instances such that not all of them runs simultaneously. +model instances such that not all of them runs simultaneously. The model priorities are used to decide which model instance -to schedule next. +to schedule next. ## Using Rate Limiter diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md new file mode 100644 index 0000000000..8db4e3b8c1 --- /dev/null +++ b/docs/user_guide/request_cancellation.md @@ -0,0 +1,102 @@ + + +# Request Cancellation + +Starting from r23.10, Triton supports handling request cancellation received +from the gRPC client or a C API user. Long running inference requests such +as for auto generative large language models may run for an indeterminate +amount of time or indeterminate number of steps. Additionally clients may +enqueue a large number of requests as part of a sequence or request stream +and later determine the results are no longer needed. Continuing to process +requests whose results are no longer required can significantly impact server +resources. + +## Issuing Request Cancellation + +### In-Process C API + +[In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel` +and `TRITONSERVER_InferenceRequestIsCancelled` to issue cancellation and query +whether cancellation has been issued on an inflight request respectively. Read more +about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + + +### gRPC Endpoint + +In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can +now detect cancellation from the client and attempt to terminate request. +At present, only gRPC python client supports issuing request cancellation +to the server endpoint. See [request-cancellation](https://github.com/triton-inference-server/client#request-cancellation) +for more details on how to issue requests from the client-side. +See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for +finer details. + +## Handling in Triton Core + +Triton core checks for requests that have been cancelled at some critical points +when using [dynamic](./model_configuration.md#dynamic-batcher) or +[sequence](./model_configuration.md#sequence-batcher) batching. The checking is +also performed between each +[ensemble](./model_configuration.md#ensemble-scheduler) steps and terminates +further processing if the request is cancelled. + +On detecting a cancelled request, Triton core responds with CANCELLED status. If a request +is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher), +then all the pending requests in the same sequence will also be cancelled. The sequence +is represented by the requests that has identical sequence id. + +**Note**: Currently, Triton core does not detect cancellation status of a request once +it is forwarded to [rate limiter](./rate_limiter.md). Improving the request cancellation +detection and handling within Triton core is work in progress. + +## Handling in Backend + +Upon receiving request cancellation, Triton does its best to terminate request +at various points. However, once a request has been given to the backend +for execution, it is up to the individual backends to detect and handle +request termination. +Currently, the following backends support early termination: +- [TensorRT-LLM backend](https://github.com/triton-inference-server/tensorrtllm_backend) +- [vLLM backend](https://github.com/triton-inference-server/vllm_backend) +- [python backend](https://github.com/triton-inference-server/python_backend) + +Python backend is a special case where we expose the APIs to detect cancellation +status of the request but it is up to the `model.py` developer to detect whether +the request is cancelled and terminate further execution. + +**For the backend developer**: The backend APIs have also been enhanced to let the +backend detect whether the request received from Triton core has been cancelled. +See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCancelled` +in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h) +for more details. The backend upon detecting request cancellation can stop processing +it any further. +The Python models running behind Python backend can also query the cancellation status +of request and response_sender. See [this](https://github.com/triton-inference-server/python_backend#request-cancellation-handling) +section in python backend documentation for more details. + diff --git a/docs/user_guide/response_cache.md b/docs/user_guide/response_cache.md new file mode 100644 index 0000000000..e70085e798 --- /dev/null +++ b/docs/user_guide/response_cache.md @@ -0,0 +1,243 @@ + + +# Triton Response Cache + +## Overview + +In this document an *inference request* is the model name, model version, and +input tensors (name, shape, datatype and tensor data) that make up a request +submitted to Triton. An inference result is the output tensors (name, shape, +datatype and tensor data) produced by an inference execution. The response cache +is used by Triton to hold inference results generated for previous executed +inference requests. Triton will maintain the response cache so that inference +requests that hit in the cache will not need to execute a model to produce +results and will instead extract their results from the cache. For some use +cases this can significantly reduce the inference request latency. + +Triton accesses the response cache with a hash of the inference request that +includes the model name, model version and model inputs. If the hash is found in +the cache, the corresponding inference result is extracted from the cache and +used for the request. When this happens there is no need for Triton to execute +the model to produce the inference result. If the hash is not found in the +cache, Triton executes the model to produce the inference result, and then +records that result in the cache so that subsequent inference requests can +(re)use those results. + +## Usage + +In order for caching to be used on a given model, it must be enabled +on both the server-side, and in the model's +[model config](model_configuration.md#response-cache). See the following +sections below for more details. + +### Enable Caching on Server-side + +The response cache is enabled on the server-side by specifying a +`` and corresponding configuration when starting +the Triton server. + +Through the CLI, this translates to setting +`tritonserver --cache-config ,= ...`. For example: +``` +tritonserver --cache-config local,size=1048576 +``` + +For in-process C API applications, this translates to calling +`TRITONSERVER_SetCacheConfig(const char* cache_implementation, const char* config_json)`. + +This allows users to enable/disable caching globally on server startup. + +### Enable Caching for a Model + +**By default, no model uses response caching even if the response cache +is enabled globally with the `--cache-config` flag.** + +For a given model to use response caching, the model must also have +response caching enabled in its model configuration: +``` +# config.pbtxt + +response_cache { + enable: true +} +``` + +This allows users to enable/disable caching for specific models. + +For more information on enabling the response cache for each model, see the +[model configuration docs](model_configuration.md#response-cache). + +### Cache Implementations + +Starting in the 23.03 release, Triton has a set of +[TRITONCACHE APIs](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritoncache.h) +that are used to communicate with a cache implementation of the user's choice. + +A cache implementation is a shared library that implements the required +TRITONCACHE APIs and is dynamically loaded on server startup, if enabled. + +Triton's most recent +[tritonserver release containers](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver) +come with the following cache implementations out of the box: +- [local](https://github.com/triton-inference-server/local_cache): `/opt/tritonserver/caches/local/libtritoncache_local.so` +- [redis](https://github.com/triton-inference-server/redis_cache): `/opt/tritonserver/caches/redis/libtritoncache_redis.so` + +With these TRITONCACHE APIs, `tritonserver` exposes a new `--cache-config` +CLI flag that gives the user flexible customization of which cache implementation +to use, and how to configure it. Similar to the `--backend-config` flag, +the expected format is `--cache-config ,=` and may +be specified multiple times to specify multiple keys if the cache implementation +requires it. + +#### Local Cache + +The `local` cache implementation is equivalent to the response cache used +internally before the 23.03 release. For more implementation specific details, +see the +[local cache implementation](https://github.com/triton-inference-server/local_cache). + +When `--cache-config local,size=SIZE` is specified with a non-zero `SIZE`, +Triton allocates the requested size in CPU memory and **shares the +cache across all inference requests and across all models**. + +#### Redis Cache + +The `redis` cache implementation exposes the ability for Triton to communicate +with a Redis server for caching. The `redis_cache` implementation is essentially +a Redis client that acts as an intermediary between Triton and Redis. + +To list a few benefits of the `redis` cache compared to the `local` cache in +the context of Triton: +- The Redis server can be hosted remotely as long as it is accessible by Triton, + so it is not tied directly to the Triton process lifetime. + - This means Triton can be restarted and still have access to previously cached entries. + - This also means that Triton doesn't have to compete with the cache for memory/resource usage. +- Multiple Triton instances can share a cache by configuring each Triton instance + to communicate with the same Redis server. +- The Redis server can be updated/restarted independently of Triton, and + Triton will fallback to operating as it would with no cache access during + any Redis server downtime, and log appropriate errors. + +In general, the Redis server can be configured/deployed as needed for your use +case, and Triton's `redis` cache will simply act as a client of your Redis +deployment. The [Redis docs](https://redis.io/docs/) should be consulted for +questions and details about configuring the Redis server. + +For Triton-specific `redis` cache implementation details/configuration, see the +[redis cache implementation](https://github.com/triton-inference-server/redis_cache). + +#### Custom Cache + +With the TRITONCACHE API interface, it is now possible for +users to implement their own cache to suit any use-case specific needs. +To see the required interface that must be implemented by a cache +developer, see the +[TRITONCACHE API header](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritoncache.h). +The `local` or `redis` cache implementations may be used as reference. + +Upon successfully developing and building a custom cache, the resulting shared +library (ex: `libtritoncache_.so`) must be placed in the cache directory +similar to where the `local` and `redis` cache implementations live. By default, +this directory is `/opt/tritonserver/caches`, but a custom directory may be +specified with `--cache-dir` as needed. + +To put this example together, if the custom cache were named "custom" +(this name is arbitrary), by default Triton would expect to find the +cache implementation at `/opt/tritonserver/caches/custom/libtritoncache_custom.so`. + +## Deprecation Notes + +> **Note** +> Prior to 23.03, enabling the `local` cache used to be done through setting a non-zero size +> (in bytes) when Triton was launched using the `--response-cache-byte-size` flag. +> +> Starting in 23.03, the `--response-cache-byte-size` flag is now deprecated and +> `--cache-config` should be used instead. For backwards compatibility, +> `--response-cache-byte-size` will continue to function under the hood by being +> converted to the corresponding `--cache-config` argument, but it will default +> to using the `local` cache implementation. It is not possible to choose other +> cache implementations using the `--response-cache-byte-size` flag. +> +> For example, `--response-cache-byte-size 1048576` +> would be equivalent to `--cache-config local,size=1048576`. However, the +> `--cache-config` flag is much more flexible and should be used instead. + +> **Warning** +> +> The `local` cache implementation may fail to initialize for very small values +> of `--cache-config local,size=` or `--response-cache-byte-size` +> (ex: less than 1024 bytes) due to internal memory management requirements. +> If you encounter an initialization error for a relatively small cache size, +> try increasing it. +> +> Similarly, the size is upper bounded by the available RAM on the system. +> If you encounter an initial allocation error for a very large cache size +> setting, try decreasing it. + +## Performance + +The response cache is intended to be used for use cases where a significant +number of duplicate requests (cache hits) are expected and therefore would +benefit from caching. The term "significant" here is subjective to the use +case, but a simple interpretation would be to consider the proportion of +expected cache hits/misses, as well as the average time spend computing +a response. + +For cases where cache hits are common and computation is expensive, +the cache can significantly improve overall performance. + +For cases where most requests are unique (cache misses) or the compute is +fast/cheap (the model is not compute-bound), the cache can negatively impact +the overall performance due to the overhead of managing and communicating with +the cache. + +## Known Limitations + +- Only input tensors located in CPU memory will be hashable for accessing the + cache. If an inference request contains input tensors not in CPU memory, the + request will not be hashed and therefore the response will not be cached. +- Only responses with all output tensors located in CPU memory will be eligible + for caching. If any output tensor in a response is not located in CPU memory, + the response will not be cached. +- The cache is accessed using only the inference request hash. As a result, if + two different inference requests generate the same hash (a hash collision), + then Triton may incorrectly use the cached result for an inference request. + The hash is a 64-bit value so the likelihood of collision is small. +- Only successful inference requests will have their responses cached. If a + request fails or returns an error during inference, its response will not be + cached. +- Only requests going through the Default Scheduler or Dynamic Batch Scheduler + are eligible for caching. The Sequence Batcher does not currently support + response caching. +- The response cache does not currently support + [decoupled models](decoupled_models.md). +- Top-level requests to ensemble models do not currently support response + caching. However, composing models within an ensemble may have their + responses cached if supported and enabled by that composing model. + diff --git a/docs/user_guide/trace.md b/docs/user_guide/trace.md new file mode 100644 index 0000000000..23d1c402d1 --- /dev/null +++ b/docs/user_guide/trace.md @@ -0,0 +1,539 @@ + + +# Triton Server Trace + +Triton includes that capability to generate a detailed trace for +individual inference requests. Tracing is enable by command-line +arguments when running the tritonserver executable. + +`--trace-config` command line option in Triton can be used to specify +global and trace mode specific config setting. The format of this flag +is `--trace-config ,=`, where `` +is either `triton` or `opentelemetry`. By default, the trace mode is set to `triton`, +and the server will use Triton's trace APIs. For `opentelemetry` mode, +the server will use the [OpenTelemetry's APIs](#opentelemetry-trace-support) to generate, +collect and export traces for individual inference requests. + +To specify global trace settings (level, rate, count, or mode), +the format is `--trace-config =`. + +An example usage, which invokes Triton's trace APIs: + +``` +$ tritonserver \ + --trace-config triton,file=/tmp/trace.json \ + --trace-config triton,log-frequency=50 \ + --trace-config rate=100 \ + --trace-config level=TIMESTAMPS \ + --trace-config count=100 ... +``` + +## Trace Settings +### Global Settings +The following table shows available global trace settings to pass to `--trace-config` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SettingDefault ValueDescription
rate1000 + Specifies the sampling rate. The same as deprecated + --trace-rate.
+ For example, a value of 1000 specifies that every 1000-th inference
+ request will be traced. +
levelOFF + Indicates the level of trace detail that should be collected and
+ may be specified multiple times to trace multiple information.
+ The same as deprecated --trace-level.
+ Choices are TIMESTAMPS and TENSORS.
+ Note that opentelemetry mode does not currently
+ support TENSORS level. +
count-1 + Specifies the remaining number of traces to be collected.
+ The default value of -1 specifies to never stop collecting traces.
+ With a value of 100, Triton will stop tracing requests
+ after 100 traces are collected.
+ The same as deprecated --trace-count. +
modetriton + Specifies which trace APIs to use for collecting traces.
+ The choices are triton or opentelemetry.
+
+ +### Triton Trace APIs Settings + +The following table shows available Triton trace APIs settings for +`--trace-config triton,=`. + + + + + + + + + + + + + + + + + + + + +
SettingDefault ValueDescription
fileempty string + Indicates where the trace output should be written.
+ The same as deprecated --trace-file.
+
log-frequency0 + Specifies the rate that the traces are written to file.
+ For example, a value of 50 specifies that Triton will log
+ to file for every 50 traces collected.
+ The same as deprecated --trace-log-frequency.
+
+ +In addition to the trace configuration settings in the command line, you can +modify the trace configuration using the [trace +protocol](../protocol/extension_trace.md). This option is currently not supported, +when trace mode is set to `opentelemetry`. + +**Note**: the following flags are **deprecated**: + +The `--trace-file` option indicates where the trace output should be +written. The `--trace-rate` option specifies the sampling rate. In +this example every 100-th inference request will be traced. The +`--trace-level` option indicates the level of trace detail that should +be collected. `--trace-level` option may be specified multiple times to +trace multiple information. The `--trace-log-frequency` option specifies the +rate that the traces are written to file. In this example Triton will log to +file for every 50 traces collected. The `--trace-count` option specifies the +remaining number of traces to be collected. In this example Triton will stop +tracing more requests after 100 traces are collected. Use the `--help` option +to get more information. + +## Supported Trace Level Option + +- `TIMESTAMPS`: Tracing execution timestamps of each request. +- `TENSORS`: Tracing input and output tensors during the execution. + +## JSON Trace Output + +The trace output is a JSON file with the following schema. + +``` +[ + { + "model_name": $string, + "model_version": $number, + "id": $number, + "request_id": $string, + "parent_id": $number + }, + { + "id": $number, + "timestamps": [ + { "name" : $string, "ns" : $number } + ] + }, + { + "id": $number + "activity": $string, + "tensor":{ + "name": $string, + "data": $string, + "shape": $string, + "dtype": $string + } + }, + ... +] +``` + +Each trace is assigned a "id", which indicates the model name and +version of the inference request. If the trace is from a +model run as part of an ensemble, the "parent_id" will indicate the +"id" of the containing ensemble. +For example: +``` +[ + { + "id": 1, + "model_name": "simple", + "model_version": 1 + }, + ... +] +``` + +Each `TIMESTAMPS` trace will have one or more "timestamps" with +each timestamp having a name and the timestamp in nanoseconds ("ns"). +For example: + +``` +[ + {"id": 1, "timestamps": [{ "name": "HTTP_RECV_START", "ns": 2356425054587444 }] }, + {"id": 1, "timestamps": [{ "name": "HTTP_RECV_END", "ns": 2356425054632308 }] }, + {"id": 1, "timestamps": [{ "name": "REQUEST_START", "ns": 2356425054785863 }] }, + {"id": 1, "timestamps": [{ "name": "QUEUE_START", "ns": 2356425054791517 }] }, + {"id": 1, "timestamps": [{ "name": "INFER_RESPONSE_COMPLETE", "ns": 2356425057587919 }] }, + {"id": 1, "timestamps": [{ "name": "COMPUTE_START", "ns": 2356425054887198 }] }, + {"id": 1, "timestamps": [{ "name": "COMPUTE_INPUT_END", "ns": 2356425057152908 }] }, + {"id": 1, "timestamps": [{ "name": "COMPUTE_OUTPUT_START", "ns": 2356425057497763 }] }, + {"id": 1, "timestamps": [{ "name": "COMPUTE_END", "ns": 2356425057540989 }] }, + {"id": 1, "timestamps": [{ "name": "REQUEST_END", "ns": 2356425057643164 }] }, + {"id": 1, "timestamps": [{ "name": "HTTP_SEND_START", "ns": 2356425057681578 }] }, + {"id": 1, "timestamps": [{ "name": "HTTP_SEND_END", "ns": 2356425057712991 }] } +] +``` + +Each `TENSORS` trace will contain an "activity" and a "tensor". +"activity" indicates the type of tensor, including "TENSOR_QUEUE_INPUT" +and "TENSOR_BACKEND_OUTPUT" by now. "tensor" has the detail of tensor, +including its "name", "data" and "dtype". For example: + +``` +[ + { + "id": 1, + "activity": "TENSOR_QUEUE_INPUT", + "tensor":{ + "name": "input", + "data": "0.1,0.1,0.1,...", + "shape": "1,16", + "dtype": "FP32" + } + } +] +``` + +## Trace Summary Tool + +An example [trace summary tool](https://github.com/triton-inference-server/server/blob/main/qa/common/trace_summary.py) can be +used to summarize a set of traces collected from Triton. Basic usage +is: + +``` +$ trace_summary.py +``` + +This produces a summary report for all traces in the file. HTTP and +GRPC inference requests are reported separately. + +``` +File: trace.json +Summary for simple (-1): trace count = 1 +HTTP infer request (avg): 378us + Receive (avg): 21us + Send (avg): 7us + Overhead (avg): 79us + Handler (avg): 269us + Overhead (avg): 11us + Queue (avg): 15us + Compute (avg): 242us + Input (avg): 18us + Infer (avg): 208us + Output (avg): 15us +Summary for simple (-1): trace count = 1 +GRPC infer request (avg): 21441us + Wait/Read (avg): 20923us + Send (avg): 74us + Overhead (avg): 46us + Handler (avg): 395us + Overhead (avg): 16us + Queue (avg): 47us + Compute (avg): 331us + Input (avg): 30us + Infer (avg): 286us + Output (avg): 14us +``` + +Use the -t option to get a summary for each trace in the file. This +summary shows the time, in microseconds, between different points in +the processing of an inference request. For example, the below output +shows that it took 15us from the start of handling the request until +the request was enqueued in the scheduling queue. + +``` +$ trace_summary.py -t +... +simple (-1): + grpc wait/read start + 26529us + grpc wait/read end + 39us + request handler start + 15us + queue start + 20us + compute start + 266us + compute end + 4us + request handler end + 19us + grpc send start + 77us + grpc send end +... +``` + +The script can also show the data flow of the first request if there are +`TENSORS` traces in the file. If the `TENSORS` traces are from an ensemble, +the data flow will be shown with the dependency of each model. + +``` +... +Data Flow: + ========================================================== + Name: ensemble + Version:1 + QUEUE_INPUT: + input: [[0.705676 0.830855 0.833153]] + BACKEND_OUTPUT: + output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]] + ========================================================== + ================================================== + Name: test_trt1 + Version:1 + QUEUE_INPUT: + input: [[0.705676 0.830855 0.833153]] + BACKEND_OUTPUT: + output1: [[1. 1. ...]] + ================================================== + ================================================== + Name: test_trt2 + Version:1 + QUEUE_INPUT: + input: [[0.705676 0.830855 0.833153]] + BACKEND_OUTPUT: + output2: [[2. 2. ...]] + ================================================== + ================================================== + Name: test_py + Version:1 + QUEUE_INPUT: + output1: [[1. 1. ...]] + QUEUE_INPUT: + output2: [[2. 2. ...]] + BACKEND_OUTPUT: + output: [[1. 2. 7. 0. 4. 7. 9. 3. 4. 9.]] + ================================================== +... +``` + +The meaning of the trace timestamps is: + +* GRPC Request Wait/Read: Collected only for inference requests that use the + GRPC protocol. The time spent waiting for a request to arrive at the + server and for that request to be read. Because wait time is + included in the time it is not a useful measure of how much time is + spent reading a request from the network. Tracing an HTTP request + will provide an accurate measure of the read time. + +* HTTP Request Receive: Collected only for inference requests that use the + HTTP protocol. The time required to read the inference request from + the network. + +* Send: The time required to send the inference response. + +* Overhead: Additional time required in the HTTP or GRPC endpoint to + process the inference request and response. + +* Handler: The total time spent handling the inference request, not + including the HTTP and GRPC request/response handling. + + * Queue: The time the inference request spent in the scheduling queue. + + * Compute: The time the inference request spent executing the actual + inference. This time includes the time spent copying input and + output tensors. If --trace-level=TIMESTAMPS then a breakdown of the + compute time will be provided as follows: + + * Input: The time to copy input tensor data as required by the + inference framework / backend. This includes the time to copy + input tensor data to the GPU. + + * Infer: The time spent executing the model to perform the + inference. + + * Output: The time to copy output tensor data as required by the + inference framework / backend. This includes the time to copy + output tensor data from the GPU. + + * Overhead: Additional time required for request handling not + covered by Queue or Compute times. + +* Data Flow: The data flow of the first request. It contains the input and + output tensors of each part of execution. + + * Name: The name of model. + + * Version: The version of model. + + * QUEUE_INPUT: The tensor entering the queue of a backend to wait for + scheduling. + + * BACKEND_OUTPUT: The tensor in the response of a backend. + +## Tracing for BLS models + +Triton does not collect traces for child models invoked from +[BLS](https://github.com/triton-inference-server/python_backend/tree/main#business-logic-scripting) +models by default. + +To include child models into collected traces, user needs to provide the `trace` +argument (as shown in the example below), when constructing an InferenceRequest object. +This helps Triton associate the child model with the parent model's trace (`request.trace()`). + +```python + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + ... + def execute(self, requests): + ... + for request in requests: + ... + inference_request = pb_utils.InferenceRequest( + model_name='model_name', + requested_output_names=['REQUESTED_OUTPUT_1', 'REQUESTED_OUTPUT_2'], + inputs=[], trace = request.trace()) + +``` + +## OpenTelemetry trace support + +Triton provides an option to generate and export traces using +[OpenTelemetry APIs and SDKs](https://opentelemetry.io/). + +To specify OpenTelemetry mode for tracing, specify the `--trace-config` +flag as follows: + +``` +$ tritonserver --trace-config mode=opentelemetry \ + --trace-config opentelemetry,url= ... +``` +### Differences in trace contents from Triton's trace [output](#json-trace-output) + +OpenTelemetry APIs produce [spans](https://opentelemetry.io/docs/concepts/observability-primer/#spans) +that collect the same timestamps as Triton's Trace +APIs. Each span also includes `model_name`, `model_version`, `request_id`, +and `parent_id` as an [attribute](https://opentelemetry.io/docs/concepts/observability-primer/#span-attributes). + +The span collects `TIMESTAMPS` that consist of a name and a timestamp +in nanoseconds, which is similar to Triton Trace APIs. However, +OpenTelemetry relies on the system's clock for event timestamps, which is based +on the system's real-time clock. On the other hand, Triton Trace APIs +report timestamps using steady clock, which is a monotonic clock that ensures +time always movess forward. This clock is not related to wall clock time +and, for example, can measure time since last reboot. + + +### OpenTelemetry trace APIs settings + +The following table shows available OpenTelemetry trace APIs settings for +`--trace-config opentelemetry,=`. + + + + + + + + + + + + + + + + + + + + +
SettingDefault ValueDescription
urlhttp://localhost:4318/v1/traces + host:port to which the receiver is going to receive + trace data. +
resourceservice.name=triton-inference-server + Key-value pairs to be used as resource attributes.
+ Should be specified following the provided template:
+ --trace-config opentelemetry,resource=<key>=<value>
+ For example:
+ --trace-config opentelemetry,resource=service.name=triton
+ --trace-config opentelemetry,resource=service.version=1
+ Alternatively, key-vaue attributes can be specified through
+ + OTEL_RESOURCE_ATTRIBUTES + environment variable. +
+ + +### Limitations + +- OpenTelemetry trace mode is not supported on Windows systems. + +- Triton supports only +[OTLP/HTTP Exporter](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlphttp) +and allows specification of only url for this exporter through +`--trace-config`. Other options and corresponding default values can be +found [here](https://github.com/open-telemetry/opentelemetry-cpp/tree/v1.8.3/exporters/otlp#configuration-options--otlp-http-exporter-). + +- Triton does not support configuration of the opentelemetry trace settings +during a Triton run and opentelemetry specific settings are not available +for the retrieval through [Triton's trace extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_trace.md). \ No newline at end of file diff --git a/docs/v1_to_v2.md b/docs/user_guide/v1_to_v2.md similarity index 95% rename from docs/v1_to_v2.md rename to docs/user_guide/v1_to_v2.md index ed01313b34..d9da6f6cf8 100644 --- a/docs/v1_to_v2.md +++ b/docs/user_guide/v1_to_v2.md @@ -51,7 +51,7 @@ version 2. * The HTTP/REST and GRPC protocols, while conceptually similar to version 1, are completely changed in version 2. See [inference - protocols](inference_protocols.md) for more information. + protocols](../customization_guide/inference_protocols.md) for more information. * Python and C++ client libraries are re-implemented to match the new HTTP/REST and GRPC protocols. The Python client no longer depends on @@ -61,7 +61,7 @@ version 2. more information. * Building Triton has changed significantly in version 2. See - [build](build.md) for more information. + [build](../customization_guide/build.md) for more information. * In the Docker containers the environment variables indicating the Triton version have changed to have a TRITON prefix, for example, diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000000..2843ad2d42 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,51 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +[tool.codespell] +# note: pre-commit passes explicit lists of files here, which this skip file list doesn't override - +# this is only to allow you to run codespell interactively +skip = "./.git,./.github" +# ignore short words, and typename parameters like OffsetT +ignore-regex = "\\b(.{1,4}|[A-Z]\\w*T)\\b" +# ignore allowed words +ignore-words-list = "passin" +# use the 'clear' dictionary for unambiguous spelling mistakes +builtin = "clear" +# disable warnings about binary files and wrong encoding +quiet-level = 3 + +[tool.isort] +profile = "black" +use_parentheses = true +multi_line_output = 3 +include_trailing_comma = true +force_grid_wrap = 0 +ensure_newline_before_comments = true +line_length = 88 +balanced_wrapping = true +indent = " " +skip = ["build"] + diff --git a/qa/L0_async_work_queue/test.sh b/qa/L0_async_work_queue/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_backend_bls/test.sh b/qa/L0_backend_bls/test.sh index 505d572608..f2193ee801 100755 --- a/qa/L0_backend_bls/test.sh +++ b/qa/L0_backend_bls/test.sh @@ -37,13 +37,14 @@ source ../common/util.sh RET=0 # Backend build requires recent version of CMake (FetchContent required) -wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \ - gpg --dearmor - | \ - tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \ - apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \ - apt-get update && \ - apt-get install -y --no-install-recommends \ - cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1 \ +# Using CMAKE installation instruction from:: https://apt.kitware.com/ +apt update -q=2 \ + && apt install -y gpg wget \ + && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \ + && . /etc/os-release \ + && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \ + && apt-get update -q=2 \ + && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* \ rapidjson-dev cmake --version diff --git a/qa/L0_backend_config/test.sh b/qa/L0_backend_config/test.sh old mode 100644 new mode 100755 index 3bd7890ceb..b898735798 --- a/qa/L0_backend_config/test.sh +++ b/qa/L0_backend_config/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -66,7 +66,7 @@ POSITIVE_TEST_ARGS=("--backend-config=tensorflow,default-max-batch-size=5 $COMMO "--backend-config=default-max-batch-size=7 --backend-config=tensorflow,default-max-batch-size=8 $COMMON_ARGS" \ ) -# These integers correspond to the expected default-max-batch-size which gets set +# These integers correspond to the expected default-max-batch-size which gets set # in the POSITIVE_TEST_ARGS POSITIVE_TEST_ANSWERS=(5 6 8) @@ -86,12 +86,12 @@ else RESULT_LOG_LINE=$(grep -a "Adding default backend config setting:" $SERVER_LOG) if [ "$RESULT_LOG_LINE" != "" ]; then - + # Pick out the logged value of the default-max-batch-size which gets passed into model creation RESOLVED_DEFAULT_MAX_BATCH_SIZE=$(awk -v line="$RESULT_LOG_LINE" 'BEGIN {split(line, a, "]"); split(a[2], b, ": "); split(b[2], c, ","); print c[2]}') if [ "$RESOLVED_DEFAULT_MAX_BATCH_SIZE" != "4" ]; then - echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: default-max-batch-size,4, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n" + echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: default-max-batch-size,4, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n" RET=1 fi else @@ -104,7 +104,7 @@ for ((i=0; i < ${#POSITIVE_TEST_ARGS[@]}; i++)); do SERVER_ARGS=${POSITIVE_TEST_ARGS[$i]} SERVER_LOG=$SERVER_LOG_BASE.backend_config_positive_$i.log run_server - + if [ "$SERVER_PID" == "0" ]; then echo -e "*** FAILED: Server failed to start $SERVER\n" RET=1 @@ -115,12 +115,12 @@ for ((i=0; i < ${#POSITIVE_TEST_ARGS[@]}; i++)); do RESULT_LOG_LINE=$(grep -a "Found overwritten default setting:" $SERVER_LOG) if [ "$RESULT_LOG_LINE" != "" ]; then - + # Pick out the logged value of the default-max-batch-size which gets passed into model creation RESOLVED_DEFAULT_MAX_BATCH_SIZE=$(awk -v line="$RESULT_LOG_LINE" 'BEGIN {split(line, a, "]"); split(a[2], b, ": "); split(b[2], c, ","); print c[2]}') if [ "$RESOLVED_DEFAULT_MAX_BATCH_SIZE" != "${POSITIVE_TEST_ANSWERS[$i]}" ]; then - echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: ${POSITIVE_TEST_ANSWERS[$i]}, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n" + echo "*** FAILED: Found default-max-batch-size not equal to the expected default-max-batch-size. Expected: ${POSITIVE_TEST_ANSWERS[$i]}, Found: $RESOLVED_DEFAULT_MAX_BATCH_SIZE \n" RET=1 fi else @@ -152,11 +152,11 @@ done # -# Sepcific backend tests -# +# Specific backend tests +# -# While inference server is running, save the -# config of the 'no_config' model to the TRIAL +# While inference server is running, save the +# config of the 'no_config' model to the TRIAL # file. function save_model_config() { CODE=`curl -s -w %{http_code} -o ./$TRIAL.out localhost:8000/v2/models/no_config/config` @@ -192,13 +192,13 @@ else RET=1 fi - # Assert we are also turning on the dynamic_batcher + # Assert we are also turning on the dynamic_batcher DYNAMIC_BATCHING_LOG_LINE=$(grep -a "Starting dynamic-batcher thread" $SERVER_LOG) if [ "$DYNAMIC_BATCHING_LOG_LINE" == "" ]; then echo "*** FAILED: Expected dynamic batching to be set in model config but was not found\n" RET=1 fi - + kill $SERVER_PID wait $SERVER_PID @@ -225,7 +225,7 @@ else RET=1 fi - # Assert batching disabled + # Assert batching disabled if [ "$(grep -a -E '\"dynamic_batching\": \{}' $SERVER_LOG)" != "" ]; then echo "*** FAILED: Found dynamic batching enabled in configuration when none expected.\n" RET=1 @@ -252,7 +252,7 @@ if [ "$SERVER_PID" == "0" ]; then else save_model_config - + # Assert the max-batch-size is the command line value MAX_BATCH_LOG_LINE=$(grep -a "\"max_batch_size\":5" $TRIAL.out) if [ "$MAX_BATCH_LOG_LINE" == "" ]; then @@ -260,13 +260,13 @@ else RET=1 fi - # Assert we are also turning on the dynamic_batcher + # Assert we are also turning on the dynamic_batcher DYNAMIC_BATCHING_LOG_LINE=$(grep -a "Starting dynamic-batcher thread" $SERVER_LOG) if [ "$DYNAMIC_BATCHING_LOG_LINE" == "" ]; then echo "*** FAILED: Expected dynamic batching to be set in model config but was not found\n" RET=1 fi - + kill $SERVER_PID wait $SERVER_PID fi @@ -296,7 +296,7 @@ else RET=1 fi - # Assert batching disabled + # Assert batching disabled if [ "$(grep -a -E '\"dynamic_batching\": \{}' $SERVER_LOG)" != "" ]; then echo "*** FAILED: Found dynamic batching in configuration when none expected.\n" RET=1 @@ -307,6 +307,97 @@ else fi +# +# General backend tests +# + +# We want to make sure that backend configurations +# are not lost. For this purpose we are using only onnx backend + +rm -rf ./models/ +mkdir -p ./models/no_config/ +cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/onnx_float32_float32_float32/1 ./models/no_config/ + +# First getting a baseline for the number of default configs +# added during a server set up +SERVER_ARGS="$COMMON_ARGS" +SERVER_LOG=$SERVER_LOG_BASE.default_configs.log +run_server + +if [ "$SERVER_PID" == "0" ]; then + echo -e "*** FAILED: Server failed to start $SERVER\n" + RET=1 + +else + # Count number of default configs + BACKEND_CONFIG_MAP=$(grep -a "backend configuration:" $SERVER_LOG -A 1 | grep -v "backend configuration") + DEFAULT_CONFIG_COUNT=$(echo $BACKEND_CONFIG_MAP | jq -r | jq '.["cmdline"]' | jq 'length') + if [ $DEFAULT_CONFIG_COUNT -lt 4 ]; then + echo "*** FAILED: Expected number of default configs to be at least 4 but found: $DEFAULT_CONFIG_COUNT\n" + RET=1 + fi + + kill $SERVER_PID + wait $SERVER_PID + +fi + +# Now make sure that when setting specific backend configs +# default ones are not lost. +# Current logic for backend config resolution reads default configs first, +# then specific configs and overrides defaults if needed. +# We would like to make sure that none of configs are lost and +# defaults are properly overridden. +# One of defaultconfigs is `min-compute-capability`. This test +# checks if it is properlly overridden. +MIN_COMPUTE_CAPABILITY=XX +SERVER_ARGS="--backend-config=onnxruntime,min-compute-capability=$MIN_COMPUTE_CAPABILITY $COMMON_ARGS" +SERVER_LOG=$SERVER_LOG_BASE.global_configs.log +run_server + +if [ "$SERVER_PID" == "0" ]; then + echo -e "*** FAILED: Server failed to start $SERVER\n" + RET=1 + +else + # Count number of default configs + BACKEND_CONFIG_MAP=$(grep -a "backend configuration:" $SERVER_LOG -A 1 | grep -v "backend configuration") + CONFIG_VALUE=$(echo $BACKEND_CONFIG_MAP | jq -r | jq '.["cmdline"]' | jq -r '.["min-compute-capability"]') + + if [ $CONFIG_VALUE != $MIN_COMPUTE_CAPABILITY ]; then + echo "*** FAILED: Expected min-compute-capability config to be $MIN_COMPUTE_CAPABILITY but found: $CONFIG_VALUE\n" + RET=1 + fi + + kill $SERVER_PID + wait $SERVER_PID + +fi +# Now make sure that specific backend configs are not lost. +SERVER_ARGS="--backend-config=onnxruntime,a=0 --backend-config=onnxruntime,y=0 --backend-config=onnxruntime,z=0 $COMMON_ARGS" +SERVER_LOG=$SERVER_LOG_BASE.specific_configs.log +EXPECTED_CONFIG_COUNT=$(($DEFAULT_CONFIG_COUNT+3)) +run_server + +if [ "$SERVER_PID" == "0" ]; then + echo -e "*** FAILED: Server failed to start $SERVER\n" + RET=1 + +else + # Count number of default configs + BACKEND_CONFIG_MAP=$(grep -a "backend configuration:" $SERVER_LOG -A 1 | grep -v "backend configuration") + TOTAL_CONFIG_COUNT=$(echo $BACKEND_CONFIG_MAP | jq -r | jq '.["cmdline"]' | jq 'length') + + if [ $TOTAL_CONFIG_COUNT -ne $EXPECTED_CONFIG_COUNT ]; then + echo "*** FAILED: Expected number of backend configs to be $EXPECTED_CONFIG_COUNT but found: $TOTAL_CONFIG_COUNT\n" + RET=1 + fi + + kill $SERVER_PID + wait $SERVER_PID + +fi + # Print test outcome if [ $RET -eq 0 ]; then diff --git a/qa/L0_backend_fastertransformer/test.sh b/qa/L0_backend_fastertransformer/test.sh new file mode 100755 index 0000000000..8e5d20271a --- /dev/null +++ b/qa/L0_backend_fastertransformer/test.sh @@ -0,0 +1,83 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +FASTERTRANSFORMER_BRANCH_TAG=${FASTERTRANSFORMER_BRANCH_TAG:="main"} +FASTERTRANSFORMER_BRANCH=${FASTERTRANSFORMER_BRANCH:="https://github.com/triton-inference-server/fastertransformer_backend.git"} +SERVER_TIMEOUT=600 +SERVER_LOG="$PWD/inference_server" +CLIENT_LOG="$PWD/client" + +MODEL_DIR=${MODEL_DIR:=$PWD/fastertransformer_backend/all_models/t5/} +TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} +SERVER=${TRITON_DIR}/bin/tritonserver +BACKEND_DIR=${TRITON_DIR}/backends +SERVER_ARGS_EXTRA="--exit-timeout-secs=${SERVER_TIMEOUT} --backend-directory=${BACKEND_DIR}" +SERVER_ARGS="--model-repository=${MODEL_DIR} ${SERVER_ARGS_EXTRA}" +source ../common/util.sh + +rm -f $SERVER_LOG* $CLIENT_LOG* + +RET=0 +# install dependencies +apt-get update && \ + apt-get install -y --no-install-recommends python3 python3-pip python3-protobuf +python3 -m pip install --upgrade pip && \ + pip3 install --upgrade numpy + +# install client libraries +pip3 install tritonclient[all] + +# Clone repo +git clone --single-branch --depth=1 -b ${FASTERTRANSFORMER_BRANCH_TAG} ${FASTERTRANSFORMER_BRANCH} +cd fastertransformer_backend + +run_server + +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e + +python3 tools/issue_request.py tools/requests/sample_request_single_t5.json >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + RET=1 +fi + +kill_server + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + cat $SERVER_LOG + cat $CLIENT_LOG + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_backend_identity/identity_test.py b/qa/L0_backend_identity/identity_test.py old mode 100644 new mode 100755 index 009576aa34..ef0634b95c --- a/qa/L0_backend_identity/identity_test.py +++ b/qa/L0_backend_identity/identity_test.py @@ -27,74 +27,45 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse -import numpy as np import sys -import requests as httpreq from builtins import range + +import numpy as np +import requests as httpreq import tritongrpcclient as grpcclient import tritonhttpclient as httpclient from tritonclientutils import np_to_triton_dtype FLAGS = None -def test_bf16_raw_http(shape): - model = "identity_bf16" - # Using fp16 data as a WAR since it is same byte_size as bf16 - # and is supported by numpy for ease-of-use. Since this is an - # identity model, it's OK that the bytes are interpreted differently - input_data = (16384 * np.random.randn(*shape)).astype(np.float16) - input_bytes = input_data.tobytes() - headers = {'Inference-Header-Content-Length': '0'} - r = httpreq.post("http://localhost:8000/v2/models/{}/infer".format(model), - data=input_bytes, - headers=headers) - r.raise_for_status() - - # Get the inference header size so we can locate the output binary data - header_size = int(r.headers["Inference-Header-Content-Length"]) - output_bytes = r.content[header_size:] - # Sanity check output on pass - print("Response content:", r.content) - print("Input Bytes:", input_bytes) - print("Output Bytes:", output_bytes) - - # Assert correct output datatype - import json - response_json = json.loads(r.content[:header_size].decode("utf-8")) - assert(response_json["outputs"][0]["datatype"] == "BF16") - - # Assert equality of input/output for identity model - if not np.array_equal(output_bytes, input_bytes): - print("error: Expected response body contains correct output binary " \ - "data: {}; got: {}".format(input_bytes, output_bytes)) - sys.exit(1) - -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() - parser.add_argument('-v', - '--verbose', - action="store_true", - required=False, - default=False, - help='Enable verbose output') - parser.add_argument('-u', - '--url', - type=str, - required=False, - help='Inference server URL.') parser.add_argument( - '-i', - '--protocol', + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-u", "--url", type=str, required=False, help="Inference server URL." + ) + parser.add_argument( + "-i", + "--protocol", type=str, required=False, - default='http', - help='Protocol ("http"/"grpc") used to ' + - 'communicate with inference service. Default is "http".') + default="http", + help='Protocol ("http"/"grpc") used to ' + + 'communicate with inference service. Default is "http".', + ) FLAGS = parser.parse_args() if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"): - print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format( - FLAGS.protocol)) + print( + 'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol) + ) exit(1) client_util = httpclient if FLAGS.protocol == "http" else grpcclient @@ -109,17 +80,18 @@ def test_bf16_raw_http(shape): model_name = "identity_uint32" request_parallelism = 4 shape = [2, 2] - with client_util.InferenceServerClient(FLAGS.url, - concurrency=request_parallelism, - verbose=FLAGS.verbose) as client: + with client_util.InferenceServerClient( + FLAGS.url, concurrency=request_parallelism, verbose=FLAGS.verbose + ) as client: input_datas = [] requests = [] for i in range(request_parallelism): input_data = (16384 * np.random.randn(*shape)).astype(np.uint32) input_datas.append(input_data) inputs = [ - client_util.InferInput("INPUT0", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + client_util.InferInput( + "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) requests.append(client.async_infer(model_name, inputs)) @@ -136,32 +108,44 @@ def test_bf16_raw_http(shape): sys.exit(1) if not np.array_equal(output_data, input_datas[i]): - print("error: expected output {} to match input {}".format( - output_data, input_datas[i])) + print( + "error: expected output {} to match input {}".format( + output_data, input_datas[i] + ) + ) sys.exit(1) # Make sure the requests ran in parallel. stats = client.get_inference_statistics(model_name) - if (len(stats['model_stats']) != - 1) or (stats['model_stats'][0]['name'] != model_name): + if (len(stats["model_stats"]) != 1) or ( + stats["model_stats"][0]["name"] != model_name + ): print("error: expected statistics for {}".format(model_name)) sys.exit(1) - stat = stats['model_stats'][0] - if (stat['inference_count'] != 8) or (stat['execution_count'] != 1): + stat = stats["model_stats"][0] + if (stat["inference_count"] != 8) or (stat["execution_count"] != 1): print( - "error: expected execution_count == 1 and inference_count == 8, got {} and {}" - .format(stat['execution_count'], stat['inference_count'])) + "error: expected execution_count == 1 and inference_count == 8, got {} and {}".format( + stat["execution_count"], stat["inference_count"] + ) + ) sys.exit(1) # Check metrics to make sure they are reported correctly - metrics = httpreq.get('http://localhost:8002/metrics') + metrics = httpreq.get("http://localhost:8002/metrics") print(metrics.text) - success_str = 'nv_inference_request_success{model="identity_uint32",version="1"}' + success_str = ( + 'nv_inference_request_success{model="identity_uint32",version="1"}' + ) infer_count_str = 'nv_inference_count{model="identity_uint32",version="1"}' - infer_exec_str = 'nv_inference_exec_count{model="identity_uint32",version="1"}' - custom_metric_str = 'input_byte_size_counter{model="identity_uint32",version="1"}' + infer_exec_str = ( + 'nv_inference_exec_count{model="identity_uint32",version="1"}' + ) + custom_metric_str = ( + 'input_byte_size_counter{model="identity_uint32",version="1"}' + ) success_val = None infer_count_val = None @@ -169,55 +153,69 @@ def test_bf16_raw_http(shape): custom_metric_val = None for line in metrics.text.splitlines(): if line.startswith(success_str): - success_val = float(line[len(success_str):]) + success_val = float(line[len(success_str) :]) if line.startswith(infer_count_str): - infer_count_val = float(line[len(infer_count_str):]) + infer_count_val = float(line[len(infer_count_str) :]) if line.startswith(infer_exec_str): - infer_exec_val = float(line[len(infer_exec_str):]) + infer_exec_val = float(line[len(infer_exec_str) :]) if line.startswith(custom_metric_str): - custom_metric_val = float(line[len(custom_metric_str):]) + custom_metric_val = float(line[len(custom_metric_str) :]) if success_val != 4: - print("error: expected metric {} == 4, got {}".format( - success_str, success_val)) + print( + "error: expected metric {} == 4, got {}".format( + success_str, success_val + ) + ) sys.exit(1) if infer_count_val != 8: - print("error: expected metric {} == 8, got {}".format( - infer_count_str, infer_count_val)) + print( + "error: expected metric {} == 8, got {}".format( + infer_count_str, infer_count_val + ) + ) sys.exit(1) if infer_exec_val != 1: - print("error: expected metric {} == 1, got {}".format( - infer_exec_str, infer_exec_val)) + print( + "error: expected metric {} == 1, got {}".format( + infer_exec_str, infer_exec_val + ) + ) sys.exit(1) if custom_metric_val != 64: - print("error: expected metric {} == 64, got {}".format( - custom_metric_str, custom_metric_val)) + print( + "error: expected metric {} == 64, got {}".format( + custom_metric_str, custom_metric_val + ) + ) sys.exit(1) # Reuse a single client for all sync tests - with client_util.InferenceServerClient(FLAGS.url, - verbose=FLAGS.verbose) as client: + with client_util.InferenceServerClient(FLAGS.url, verbose=FLAGS.verbose) as client: for model_name, np_dtype, shape in ( - # yapf: disable + # yapf: disable ("identity_fp32", np.float32, [1, 0]), ("identity_fp32", np.float32, [1, 5]), ("identity_uint32", np.uint32, [4, 0]), ("identity_uint32", np.uint32, [8, 5]), ("identity_nobatch_int8", np.int8, [0]), ("identity_nobatch_int8", np.int8, [7]), - ("identity_bytes", object, [1, 1])): + ("identity_bytes", object, [1, 1]), + ("identity_bf16", np.float32, [1, 0]), + ("identity_bf16", np.float32, [1, 5]) + ): # yapf: enable if np_dtype != object: input_data = (16384 * np.random.randn(*shape)).astype(np_dtype) else: - in0 = (16384 * np.ones(shape, dtype='int')) - in0n = np.array([str(x) for x in in0.reshape(in0.size)], - dtype=object) + in0 = 16384 * np.ones(shape, dtype="int") + in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object) input_data = in0n.reshape(in0.shape) - inputs = [ - client_util.InferInput("INPUT0", input_data.shape, - np_to_triton_dtype(input_data.dtype)) - ] + if model_name != "identity_bf16": + triton_type = np_to_triton_dtype(input_data.dtype) + else: + triton_type = "BF16" + inputs = [client_util.InferInput("INPUT0", input_data.shape, triton_type)] inputs[0].set_data_from_numpy(input_data) results = client.infer(model_name, inputs) @@ -228,17 +226,48 @@ def test_bf16_raw_http(shape): if np_dtype == object: output_data = np.array( - [str(x, encoding='utf-8') for x in output_data.flatten()], - dtype=object).reshape(output_data.shape) + [str(x, encoding="utf-8") for x in output_data.flatten()], + dtype=object, + ).reshape(output_data.shape) if output_data is None: print("error: expected 'OUTPUT0'") sys.exit(1) - if not np.array_equal(output_data, input_data): - print("error: expected output {} to match input {}".format( - output_data, input_data)) - sys.exit(1) + if model_name == "identity_bf16": + if input_data.shape != output_data.shape: + print( + "error: expected output shape {} to match input shape {}".format( + output_data.shape, input_data.shape + ) + ) + sys.exit(1) + for input, output in zip( + np.nditer(input_data, flags=["refs_ok", "zerosize_ok"], order="C"), + np.nditer(output_data, flags=["refs_ok", "zerosize_ok"], order="C"), + ): + if input.tobytes()[2:4] != output.tobytes()[2:4]: + print( + "error: expected low-order bits of output {} to match low-order bits of input {}".format( + output, input + ) + ) + sys.exit(1) + if output.tobytes()[0:2] != b"\x00\x00": + print( + "error: expected output {} to have all-zero high-order bits, got {}".format( + output, output.tobytes()[0:2] + ) + ) + sys.exit(1) + else: + if not np.array_equal(output_data, input_data): + print( + "error: expected output {} to match input {}".format( + output_data, input_data + ) + ) + sys.exit(1) # Make sure response parameters are correct response = results.get_response() @@ -254,8 +283,7 @@ def test_bf16_raw_http(shape): param2 = params["param2"].bool_param if param0 != "an example string parameter": - print( - "error: expected 'param0' == 'an example string parameter'") + print("error: expected 'param0' == 'an example string parameter'") sys.exit(1) if param1 != 42: print("error: expected 'param1' == 42") @@ -263,8 +291,3 @@ def test_bf16_raw_http(shape): if param2 != False: print("error: expected 'param2' == False") sys.exit(1) - - # FIXME: Use identity_bf16 model in test above once proper python client - # support is added, and remove this raw HTTP test. See DLIS-3720. - test_bf16_raw_http([2, 2]) - diff --git a/qa/L0_backend_identity/test.sh b/qa/L0_backend_identity/test.sh index d49686493c..bd29951ba6 100755 --- a/qa/L0_backend_identity/test.sh +++ b/qa/L0_backend_identity/test.sh @@ -82,7 +82,7 @@ wait $SERVER_PID # Validate the byte_sizes reported by backend OLDIFS=$IFS; IFS=',' -for i in "byte_size = 0, 6", \ +for i in "byte_size = 0, 8", \ "byte_size = 7, 2", \ "byte_size = 16, 6", \ "byte_size = 20, 2", \ diff --git a/qa/L0_backend_output_detail/test.sh b/qa/L0_backend_output_detail/test.sh new file mode 100755 index 0000000000..a8f4de59d1 --- /dev/null +++ b/qa/L0_backend_output_detail/test.sh @@ -0,0 +1,69 @@ +#!/bin/bash +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "No Repo version detected" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi +export CUDA_VISIBLE_DEVICES=0 + +rm -f *.log +MODELSDIR=`pwd`/models +rm -fr $MODELSDIR && mkdir -p $MODELSDIR/add_sub/1 && \ + cp ../python_models/add_sub/config.pbtxt $MODELSDIR/add_sub && \ + cp ../python_models/add_sub/model.py $MODELSDIR/add_sub/1 && \ + +source ../common/util.sh + +RET=0 + +TEST_LOG="./backend_output_detail_test.log" +TEST_EXEC=./backend_output_detail_test + +set +e +LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH $TEST_EXEC >>$TEST_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Backend Output Detail Unit Test Failed\n***" + RET=1 +fi +set -e + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + cat $TEST_LOG + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py b/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py index 5af497aa0b..df1b298a35 100644 --- a/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py +++ b/qa/L0_backend_python/argument_validation/models/argument_validation/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,18 +24,18 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np import unittest + +import numpy as np import triton_python_backend_utils as pb_utils class ArgumentValidationTest(unittest.TestCase): - def test_infer_request_args(self): # Dummy arguments used in the tests. - inputs = [pb_utils.Tensor('INPUT0', np.asarray([1, 2], dtype=np.int32))] - model_name = 'my_model' - requested_output_names = ['my_output'] + inputs = [pb_utils.Tensor("INPUT0", np.asarray([1, 2], dtype=np.int32))] + model_name = "my_model" + requested_output_names = ["my_output"] # # inputs field validation @@ -46,21 +46,24 @@ def test_infer_request_args(self): pb_utils.InferenceRequest( inputs=[None], model_name=model_name, - requested_output_names=requested_output_names) + requested_output_names=requested_output_names, + ) # Test None object as list of inputs with self.assertRaises(TypeError) as e: pb_utils.InferenceRequest( inputs=None, model_name=model_name, - requested_output_names=requested_output_names) + requested_output_names=requested_output_names, + ) # model_name validation with self.assertRaises(TypeError) as e: pb_utils.InferenceRequest( model_name=None, inputs=inputs, - requested_output_names=requested_output_names) + requested_output_names=requested_output_names, + ) # # Requested output name validations @@ -68,14 +71,14 @@ def test_infer_request_args(self): # Test list of None objects as requested_output_names with self.assertRaises(TypeError) as e: - pb_utils.InferenceRequest(requested_output_names=[None], - inputs=inputs, - model_name=model_name) + pb_utils.InferenceRequest( + requested_output_names=[None], inputs=inputs, model_name=model_name + ) with self.assertRaises(TypeError) as e: - pb_utils.InferenceRequest(requested_output_names=None, - inputs=inputs, - model_name=model_name) + pb_utils.InferenceRequest( + requested_output_names=None, inputs=inputs, model_name=model_name + ) # Other arguments validation @@ -85,7 +88,8 @@ def test_infer_request_args(self): requested_output_names=requested_output_names, inputs=inputs, model_name=model_name, - correleation_id=None) + correleation_id=None, + ) # request_id set to None with self.assertRaises(TypeError) as e: @@ -93,7 +97,8 @@ def test_infer_request_args(self): requested_output_names=requested_output_names, inputs=inputs, model_name=model_name, - request_id=None) + request_id=None, + ) # model_version set to None with self.assertRaises(TypeError) as e: @@ -101,7 +106,8 @@ def test_infer_request_args(self): requested_output_names=requested_output_names, inputs=inputs, model_name=model_name, - model_version=None) + model_version=None, + ) # flags set to None with self.assertRaises(TypeError) as e: @@ -109,17 +115,16 @@ def test_infer_request_args(self): requested_output_names=requested_output_names, inputs=inputs, model_name=model_name, - flags=None) + flags=None, + ) # Empty lists should not raise an exception - pb_utils.InferenceRequest(requested_output_names=[], - inputs=[], - model_name=model_name) + pb_utils.InferenceRequest( + requested_output_names=[], inputs=[], model_name=model_name + ) def test_infer_response_args(self): - outputs = [ - pb_utils.Tensor('OUTPUT0', np.asarray([1, 2], dtype=np.int32)) - ] + outputs = [pb_utils.Tensor("OUTPUT0", np.asarray([1, 2], dtype=np.int32))] # Test list of None object as output tensor with self.assertRaises(pb_utils.TritonModelException) as e: @@ -145,17 +150,47 @@ def test_tensor_args(self): pb_utils.Tensor("OUTPUT0", None) # Test None as dlpack capsule - with self.assertRaises(TypeError) as e: + with self.assertRaises(pb_utils.TritonModelException) as e: pb_utils.Tensor.from_dlpack("OUTPUT0", None) - # Test empty string as model name (from_dlpack) - with self.assertRaises(TypeError) as e: + # Test empty string as tensor name (from_dlpack) + with self.assertRaises(pb_utils.TritonModelException) as e: pb_utils.Tensor.from_dlpack("", None) - # Test empty string as model name + # Test empty string as tensor name with self.assertRaises(TypeError) as e: pb_utils.Tensor("", None) + def test_log_args(self): + logger = pb_utils.Logger + + # Test None as log level setting + with self.assertRaises(TypeError) as e: + logger.log("Invalid Level", None) + + # Test integer as log level setting + with self.assertRaises(TypeError) as e: + logger.log("Invalid Level", 1) + + # Test None as log info msg + with self.assertRaises(TypeError) as e: + logger.log_info(None) + + # Test None as log warning msg + with self.assertRaises(TypeError) as e: + logger.log_warn(None) + + # Test None as log error msg + with self.assertRaises(TypeError) as e: + logger.log_error(None) + + # Test None as log verbose msg + with self.assertRaises(TypeError) as e: + logger.log_verbose(None) + + # This should not raise an exception + logger.log("Level unspecified") + class TritonPythonModel: """This model tests the Python API arguments to make sure invalid args are @@ -165,12 +200,15 @@ def execute(self, requests): responses = [] for _ in requests: # Run the unittest and store the results in InferenceResponse. - test = unittest.main('model', exit=False) + test = unittest.main("model", exit=False) responses.append( - pb_utils.InferenceResponse([ - pb_utils.Tensor( - 'OUTPUT0', - np.array([test.result.wasSuccessful()], - dtype=np.float16)) - ])) + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", + np.array([test.result.wasSuccessful()], dtype=np.float16), + ) + ] + ) + ) return responses diff --git a/qa/L0_backend_python/argument_validation/test.sh b/qa/L0_backend_python/argument_validation/test.sh old mode 100644 new mode 100755 index f80ce3e84b..b7f6e96293 --- a/qa/L0_backend_python/argument_validation/test.sh +++ b/qa/L0_backend_python/argument_validation/test.sh @@ -1,4 +1,5 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/bin/bash +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,14 +26,14 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. CLIENT_PY=../python_unittest.py -CLIENT_LOG="./client.log" +CLIENT_LOG="./arg_validation_client.log" EXPECTED_NUM_TESTS="1" TEST_RESULT_FILE='test_results.txt' TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./arg_validation_server.log" RET=0 source ../../common/util.sh diff --git a/qa/L0_backend_python/bls/bls_parameters_test.py b/qa/L0_backend_python/bls/bls_parameters_test.py new file mode 100755 index 0000000000..e08ab2b96f --- /dev/null +++ b/qa/L0_backend_python/bls/bls_parameters_test.py @@ -0,0 +1,71 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json +import unittest + +import numpy as np +import tritonclient.grpc as grpcclient +from tritonclient.utils import np_to_triton_dtype + + +class TestBlsParameters(unittest.TestCase): + def test_bls_parameters(self): + model_name = "bls_parameters" + shape = [1] + num_params = 3 + + # Based on the num_params specified, the model will generate a JSON response + # containing all the supported parameter types for num_params times recursively. + # Make sure the model has at least num_params + 1 instances. + expected_params = {} + for i in range(1, num_params + 1): + expected_params["bool_" + str(i)] = bool(i) + expected_params["int_" + str(i)] = i + expected_params["str_" + str(i)] = str(i) + + with grpcclient.InferenceServerClient("localhost:8001") as client: + input_data = np.array([num_params], dtype=np.ubyte) + inputs = [ + grpcclient.InferInput( + "NUMBER_PARAMETERS", shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + outputs = [grpcclient.InferRequestedOutput("PARAMETERS_AGGREGATED")] + result = client.infer(model_name, inputs, outputs=outputs) + params_json = str( + result.as_numpy("PARAMETERS_AGGREGATED")[0], encoding="utf-8" + ) + + params = json.loads(params_json) + self.assertEqual(params, expected_params) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_backend_python/bls/test.sh b/qa/L0_backend_python/bls/test.sh old mode 100644 new mode 100755 index 62a98dd228..95abc84e06 --- a/qa/L0_backend_python/bls/test.sh +++ b/qa/L0_backend_python/bls/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,7 +26,7 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. CLIENT_PY=../python_unittest.py -CLIENT_LOG="./client.log" +CLIENT_LOG="./bls_client.log" EXPECTED_NUM_TESTS="1" TEST_RESULT_FILE='test_results.txt' source ../../common/util.sh @@ -34,14 +34,14 @@ source ../../common/util.sh TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends -SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" RET=0 -rm -fr *.log ./models +# This variable is used to print out the correct server log for each sub-test. +SUB_TEST_RET=0 +rm -fr *.log ./models *.txt pip3 uninstall -y torch -pip3 install torch==1.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html +pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html mkdir -p models/bls/1/ cp ../../python_models/bls/model.py models/bls/1/ @@ -81,6 +81,178 @@ cp ../../python_models/dlpack_identity/config.pbtxt models/dlpack_identity cp -r ${DATADIR}/qa_sequence_implicit_model_repository/onnx_nobatch_sequence_int32/ ./models +git clone https://github.com/triton-inference-server/python_backend -b $PYTHON_BACKEND_REPO_TAG +mkdir -p models/square_int32/1/ +cp python_backend/examples/decoupled/square_model.py models/square_int32/1/model.py +cp python_backend/examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt + +mkdir -p models/dlpack_square/1/ +cp ../../python_models/dlpack_square/model.py models/dlpack_square/1/ +cp ../../python_models/dlpack_square/config.pbtxt models/dlpack_square + +mkdir -p models/identity_fp32_timeout/1/ +cp ../../python_models/identity_fp32_timeout/model.py models/identity_fp32_timeout/1/ +cp ../../python_models/identity_fp32_timeout/config.pbtxt models/identity_fp32_timeout + +cp -r ${DATADIR}/qa_model_repository/libtorch_nobatch_float32_float32_float32/ ./models/libtorch_gpu && \ + sed -i 's/libtorch_nobatch_float32_float32_float32/libtorch_gpu/' models/libtorch_gpu/config.pbtxt && \ + echo "instance_group [ { kind: KIND_GPU} ]" >> models/libtorch_gpu/config.pbtxt + +cp -r ${DATADIR}/qa_model_repository/libtorch_nobatch_float32_float32_float32/ ./models/libtorch_cpu && \ + sed -i 's/libtorch_nobatch_float32_float32_float32/libtorch_cpu/' models/libtorch_cpu/config.pbtxt && \ + echo "instance_group [ { kind: KIND_CPU} ]" >> models/libtorch_cpu/config.pbtxt + +# Test with different sizes of CUDA memory pool +for CUDA_MEMORY_POOL_SIZE_MB in 64 128 ; do + CUDA_MEMORY_POOL_SIZE_BYTES=$((CUDA_MEMORY_POOL_SIZE_MB * 1024 * 1024)) + SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1 --cuda-memory-pool-byte-size=0:${CUDA_MEMORY_POOL_SIZE_BYTES}" + for TRIAL in non_decoupled decoupled ; do + export BLS_KIND=$TRIAL + SERVER_LOG="./bls_$TRIAL.$CUDA_MEMORY_POOL_SIZE_MB.inference_server.log" + + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + set +e + + export MODEL_NAME='bls' + python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** 'bls' $BLS_KIND test FAILED. \n***" + cat $CLIENT_LOG + RET=1 + SUB_TEST_RET=1 + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + SUB_TEST_RET=1 + fi + fi + + export MODEL_NAME='bls_memory' + python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** 'bls_memory' $BLS_KIND test FAILED. \n***" + cat $CLIENT_LOG + RET=1 + SUB_TEST_RET=1 + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + SUB_TEST_RET=1 + fi + fi + + export MODEL_NAME='bls_memory_async' + python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** 'bls_async_memory' $BLS_KIND test FAILED. \n***" + cat $CLIENT_LOG + RET=1 + SUB_TEST_RET=1 + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + SUB_TEST_RET=1 + fi + fi + + export MODEL_NAME='bls_async' + python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** 'bls_async' $BLS_KIND test FAILED. \n***" + cat $CLIENT_LOG + RET=1 + SUB_TEST_RET=1 + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + SUB_TEST_RET=1 + fi + fi + + set -e + + kill $SERVER_PID + wait $SERVER_PID + + if [ $SUB_TEST_RET -eq 1 ]; then + cat $CLIENT_LOG + cat $SERVER_LOG + fi + + # Check for bls 'test_timeout' to ensure timeout value is being correctly passed + if [ `grep -c "Request timeout: 11000000000" $SERVER_LOG` == "0" ]; then + echo -e "\n***\n*** BLS timeout value not correctly passed to model: line ${LINENO}\n***" + cat $SERVER_LOG + RET=1 + fi + + if [[ $CUDA_MEMORY_POOL_SIZE_MB -eq 128 ]]; then + if [ `grep -c "Failed to allocate memory from CUDA memory pool" $SERVER_LOG` != "0" ]; then + echo -e "\n***\n*** Expected to use CUDA memory pool for all tests when CUDA_MEMOY_POOL_SIZE_MB is 128 MB for 'bls' $BLS_KIND test\n***" + cat $SERVER_LOG + RET=1 + fi + fi + done +done + +# Test error handling when BLS is used in "initialize" or "finalize" function +ERROR_MESSAGE="BLS is only supported during the 'execute' function." + +rm -fr ./models +mkdir -p models/bls_init_error/1/ +cp ../../python_models/bls_init_error/model.py models/bls_init_error/1/ +cp ../../python_models/bls_init_error/config.pbtxt models/bls_init_error +SERVER_LOG="./bls_init_error_server.log" +SUB_TEST_RET=0 + +run_server +if [ "$SERVER_PID" != "0" ]; then + echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG + RET=1 + SUB_TEST_RET=1 + kill $SERVER_PID + wait $SERVER_PID +else + if grep "$ERROR_MESSAGE" $SERVER_LOG; then + echo -e "Found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG + else + echo -e "Not found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG + RET=1 + SUB_TEST_RET=1 + fi +fi + +if [ $SUB_TEST_RET -eq 1 ]; then + cat $CLIENT_LOG + cat $SERVER_LOG +fi + +rm -fr ./models +mkdir -p models/bls_finalize_error/1/ +cp ../../python_models/bls_finalize_error/model.py models/bls_finalize_error/1/ +cp ../../python_models/bls_finalize_error/config.pbtxt models/bls_finalize_error/ +SERVER_LOG="./bls_finalize_error_server.log" +SUB_TEST_RET=0 + run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -88,66 +260,152 @@ if [ "$SERVER_PID" == "0" ]; then exit 1 fi -set +e +kill $SERVER_PID +wait $SERVER_PID -export MODEL_NAME='bls' -python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 -if [ $? -ne 0 ]; then - echo -e "\n***\n*** 'bls' test FAILED. \n***" +if grep "$ERROR_MESSAGE" $SERVER_LOG; then + echo -e "Found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG +else + echo -e "Not found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG + RET=1 + SUB_TEST_RET=1 +fi + +if [ $SUB_TEST_RET -eq 1 ]; then cat $CLIENT_LOG + cat $SERVER_LOG +fi + +# Test model loading API with BLS +SUB_TEST_RET=0 +rm -fr ./models +mkdir -p models/bls_model_loading/1/ +cp ../../python_models/bls_model_loading/model.py models/bls_model_loading/1/ +cp ../../python_models/bls_model_loading/config.pbtxt models/bls_model_loading/ +cp -fr ${DATADIR}/qa_model_repository/onnx_int32_int32_int32 models/. +# Make only version 2, 3 is valid version directory +rm -rf models/onnx_int32_int32_int32/1 + +SERVER_LOG="./bls_model_loading_server.log" +SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --model-control-mode=explicit --log-verbose=1" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +export MODEL_NAME='bls_model_loading' + +set +e +code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${MODEL_NAME}/load` +set -e +if [ "$code" == "400" ]; then + echo -e "\n***\n*** Failed to load model '${MODEL_NAME}'\n***" RET=1 -else - check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS - if [ $? -ne 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** Test Result Verification Failed\n***" - RET=1 - fi + SUB_TEST_RET=1 fi -export MODEL_NAME='bls_memory' -python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 +set +e + +python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then - echo -e "\n***\n*** 'bls_memory' test FAILED. \n***" + echo -e "\n***\n*** 'bls_model_loading' test FAILED. \n***" cat $CLIENT_LOG RET=1 + SUB_TEST_RET=1 else check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Result Verification Failed\n***" RET=1 + SUB_TEST_RET=1 fi fi -export MODEL_NAME='bls_memory_async' -python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 -if [ $? -ne 0 ]; then - echo -e "\n***\n*** 'bls_async_memory' test FAILED. \n***" +set -e + +kill $SERVER_PID +wait $SERVER_PID + +if [ $SUB_TEST_RET -eq 1 ]; then cat $CLIENT_LOG + cat $SERVER_LOG +fi + +# Test model loading API with BLS warmup +(cd models/bls_model_loading && \ + echo "model_warmup [{" >> config.pbtxt && \ + echo " name : \"regular sample\"" >> config.pbtxt && \ + echo " batch_size: 1" >> config.pbtxt && \ + echo " inputs {" >> config.pbtxt && \ + echo " key: \"INPUT0\"" >> config.pbtxt && \ + echo " value: {" >> config.pbtxt && \ + echo " data_type: TYPE_FP32" >> config.pbtxt && \ + echo " dims: 4" >> config.pbtxt && \ + echo " zero_data: false" >> config.pbtxt && \ + echo " }" >> config.pbtxt && \ + echo " }" >> config.pbtxt && \ + echo " inputs {" >> config.pbtxt && \ + echo " key: \"INPUT1\"" >> config.pbtxt && \ + echo " value: {" >> config.pbtxt && \ + echo " data_type: TYPE_FP32" >> config.pbtxt && \ + echo " dims: 4" >> config.pbtxt && \ + echo " zero_data: false" >> config.pbtxt && \ + echo " }" >> config.pbtxt && \ + echo " }" >> config.pbtxt && \ + echo "}]" >> config.pbtxt ) + +SUB_TEST_RET=0 +SERVER_LOG="./bls_model_loading_server_warmup.log" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${MODEL_NAME}/load` +set -e +if [ "$code" == "400" ]; then + echo -e "\n***\n*** Failed to load model '${MODEL_NAME}'\n***" RET=1 -else - check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS - if [ $? -ne 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** Test Result Verification Failed\n***" - RET=1 - fi + SUB_TEST_RET=1 fi -export MODEL_NAME='bls_async' -python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 -if [ $? -ne 0 ]; then - echo -e "\n***\n*** 'bls_async' test FAILED. \n***" +kill $SERVER_PID +wait $SERVER_PID + +if [ $SUB_TEST_RET -eq 1 ]; then cat $CLIENT_LOG + cat $SERVER_LOG +fi + +# Test BLS parameters +rm -rf params_models && mkdir -p params_models/bls_parameters/1 +cp ../../python_models/bls_parameters/model.py ./params_models/bls_parameters/1 +cp ../../python_models/bls_parameters/config.pbtxt ./params_models/bls_parameters + +TEST_LOG="./bls_parameters.log" +SERVER_LOG="./bls_parameters.server.log" + +SERVER_ARGS="--model-repository=`pwd`/params_models --backend-directory=${BACKEND_DIR} --log-verbose=1" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python3 bls_parameters_test.py > $TEST_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** bls_parameters_test.py FAILED. \n***" + cat $TEST_LOG RET=1 -else - check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS - if [ $? -ne 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** Test Result Verification Failed\n***" - RET=1 - fi fi set -e @@ -155,8 +413,6 @@ kill $SERVER_PID wait $SERVER_PID if [ $RET -eq 1 ]; then - cat $CLIENT_LOG - cat $SERVER_LOG echo -e "\n***\n*** BLS test FAILED. \n***" else echo -e "\n***\n*** BLS test PASSED. \n***" diff --git a/qa/L0_backend_python/common.sh b/qa/L0_backend_python/common.sh old mode 100644 new mode 100755 index 78d4998e2b..d66f99c75f --- a/qa/L0_backend_python/common.sh +++ b/qa/L0_backend_python/common.sh @@ -1,4 +1,5 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +#!/bin/bash +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -31,7 +32,7 @@ get_shm_pages() { install_conda() { rm -rf ./miniconda - file_name="Miniconda3-py38_4.9.2-Linux-x86_64.sh" + file_name="Miniconda3-py310_23.3.1-0-Linux-x86_64.sh" wget https://repo.anaconda.com/miniconda/$file_name # install miniconda in silent mode @@ -43,21 +44,30 @@ install_conda() { install_build_deps() { apt update && apt install software-properties-common rapidjson-dev -y - wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \ - gpg --dearmor - | \ - tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \ - apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \ - apt-get update && \ - apt-get install -y --no-install-recommends \ - cmake-data=3.18.4-0kitware1ubuntu20.04.1 cmake=3.18.4-0kitware1ubuntu20.04.1 + # Using CMAKE installation instruction from:: https://apt.kitware.com/ + apt update -q=2 \ + && apt install -y gpg wget \ + && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \ + && . /etc/os-release \ + && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \ + && apt-get update -q=2 \ + && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* } create_conda_env() { - python_version=$1 - env_name=$2 + local python_version=$1 + local env_name=$2 conda create -n $env_name python=$python_version -y conda activate $env_name - conda install conda-pack -y + conda install -c conda-forge conda-pack -y +} + +create_conda_env_with_specified_path() { + local python_version=$1 + local env_path=$2 + conda create -p $env_path python=$python_version -y + conda activate $env_path + conda install -c conda-forge conda-pack -y } create_python_backend_stub() { diff --git a/qa/L0_backend_python/custom_metrics/test.sh b/qa/L0_backend_python/custom_metrics/test.sh new file mode 100755 index 0000000000..9ba098f493 --- /dev/null +++ b/qa/L0_backend_python/custom_metrics/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +CLIENT_PY=../python_unittest.py +CLIENT_LOG="./custom_metrics_client.log" +EXPECTED_NUM_TESTS="1" +TEST_RESULT_FILE='test_results.txt' +source ../../common/util.sh + +TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} +SERVER=${TRITON_DIR}/bin/tritonserver +BACKEND_DIR=${TRITON_DIR}/backends +SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" +SERVER_LOG="./custom_metrics_server.log" + +RET=0 +rm -fr *.log ./models *.txt + +mkdir -p models/custom_metrics/1/ +cp ../../python_models/custom_metrics/model.py models/custom_metrics/1/ +cp ../../python_models/custom_metrics/config.pbtxt models/custom_metrics + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e + +export MODEL_NAME='custom_metrics' +python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** 'Custom Metrics' test FAILED. \n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi + +set -e + +kill $SERVER_PID +wait $SERVER_PID + + +if [ $RET -eq 1 ]; then + cat $CLIENT_LOG + cat $SERVER_LOG + echo -e "\n***\n*** Custom Metrics test FAILED. \n***" +else + echo -e "\n***\n*** Custom Metrics test PASSED. \n***" +fi + +exit $RET diff --git a/qa/L0_backend_python/decoupled/decoupled_test.py b/qa/L0_backend_python/decoupled/decoupled_test.py old mode 100644 new mode 100755 index 715860f3b0..1fc862fd5c --- a/qa/L0_backend_python/decoupled/decoupled_test.py +++ b/qa/L0_backend_python/decoupled/decoupled_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,21 +27,21 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../../common") -import test_util as tu -import tritonclient +import queue import time -import tritonclient.grpc as grpcclient -from tritonclient.utils import * -import numpy as np import unittest from functools import partial -import queue +import numpy as np +import test_util as tu +import tritonclient.grpc as grpcclient +from tritonclient.utils import * -class UserData: +class UserData: def __init__(self): self._completed_requests = queue.Queue() @@ -52,10 +54,9 @@ def callback(user_data, result, error): class DecoupledTest(tu.TestResultCollector): - def test_decoupled_execute_error(self): # The decoupled_execute_error model returns an error for the first - # request and sucessfully processes the second request. This is making + # request and successfully processes the second request. This is making # sure that an error in a single request does not completely fail the # batch. @@ -63,8 +64,7 @@ def test_decoupled_execute_error(self): shape = [2, 2] number_of_requests = 2 user_data = UserData() - with grpcclient.InferenceServerClient( - "localhost:8001") as triton_client: + with grpcclient.InferenceServerClient("localhost:8001") as triton_client: triton_client.start_stream(callback=partial(callback, user_data)) input_datas = [] @@ -72,12 +72,12 @@ def test_decoupled_execute_error(self): input_data = np.random.randn(*shape).astype(np.float32) input_datas.append(input_data) inputs = [ - grpcclient.InferInput("IN", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + grpcclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) - triton_client.async_stream_infer(model_name=model_name, - inputs=inputs) + triton_client.async_stream_infer(model_name=model_name, inputs=inputs) for i in range(number_of_requests): result = user_data._completed_requests.get() @@ -91,27 +91,28 @@ def test_decoupled_execute_error(self): self.assertTrue( np.array_equal(output_data, input_datas[i]), "error: expected output {} to match input {}".format( - output_data, input_datas[i])) + output_data, input_datas[i] + ), + ) def test_decoupled_bls(self): # Test combinations of BLS and decoupled API in Python backend. model_name = "decoupled_bls" shape = [1, 2] user_data = UserData() - with grpcclient.InferenceServerClient( - "localhost:8001") as triton_client: + with grpcclient.InferenceServerClient("localhost:8001") as triton_client: triton_client.start_stream(callback=partial(callback, user_data)) input_datas = [] input_data = np.random.randn(*shape).astype(np.float32) input_datas.append(input_data) inputs = [ - grpcclient.InferInput("IN", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + grpcclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) - triton_client.async_stream_infer(model_name=model_name, - inputs=inputs) + triton_client.async_stream_infer(model_name=model_name, inputs=inputs) # Check the results of the decoupled model using BLS def check_result(result): @@ -123,11 +124,79 @@ def check_result(result): self.assertTrue( np.array_equal(output_data, input_data), "error: expected output {} to match input {}".format( - output_data, input_data)) + output_data, input_data + ), + ) result = user_data._completed_requests.get() check_result(result) + def test_decoupled_bls_stream(self): + # Test combinations of BLS and decoupled API in Python backend. + model_name = "decoupled_bls_stream" + in_values = [4, 2, 0, 1] + shape = [1] + user_data = UserData() + with grpcclient.InferenceServerClient("localhost:8001") as triton_client: + triton_client.start_stream(callback=partial(callback, user_data)) + for i in range(len(in_values)): + input_data = np.array([in_values[i]], dtype=np.int32) + inputs = [ + grpcclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + triton_client.async_stream_infer( + model_name=model_name, inputs=inputs, request_id=str(i) + ) + + # Retrieve results... + recv_count = 0 + expected_count = sum(in_values) + result_dict = {} + while recv_count < expected_count: + data_item = user_data._completed_requests.get() + self.assertIsNot(type(data_item), InferenceServerException) + + this_id = data_item.get_response().id + if this_id not in result_dict.keys(): + result_dict[this_id] = [] + result_dict[this_id].append((recv_count, data_item)) + + recv_count += 1 + # Validate results... + for i in range(len(in_values)): + this_id = str(i) + is_received = False + if this_id in result_dict.keys(): + is_received = True + + if in_values[i] != 0: + self.assertTrue( + is_received, + "response for request id {} not received".format(this_id), + ) + self.assertEqual(len(result_dict[this_id]), in_values[i]) + + result_list = result_dict[this_id] + expected_data = np.array([in_values[i]], dtype=np.int32) + for j in range(len(result_list)): + this_data = result_list[j][1].as_numpy("OUT") + self.assertTrue( + np.array_equal(expected_data, this_data), + "error: incorrect data: expected {}, got {}".format( + expected_data, this_data + ), + ) + else: + self.assertFalse( + is_received, + "received unexpected response for request id {}".format( + this_id + ), + ) + def test_decoupled_return_response_error(self): model_name = "decoupled_return_response_error" shape = [16] @@ -137,10 +206,12 @@ def test_decoupled_return_response_error(self): input_data_0 = np.random.random(shape).astype(np.float32) input_data_1 = np.random.random(shape).astype(np.float32) inputs = [ - grpcclient.InferInput("INPUT0", input_data_0.shape, - np_to_triton_dtype(input_data_0.dtype)), - grpcclient.InferInput("INPUT1", input_data_1.shape, - np_to_triton_dtype(input_data_1.dtype)) + grpcclient.InferInput( + "INPUT0", input_data_0.shape, np_to_triton_dtype(input_data_0.dtype) + ), + grpcclient.InferInput( + "INPUT1", input_data_1.shape, np_to_triton_dtype(input_data_1.dtype) + ), ] inputs[0].set_data_from_numpy(input_data_0) inputs[1].set_data_from_numpy(input_data_1) @@ -149,9 +220,11 @@ def test_decoupled_return_response_error(self): if type(data_item) == InferenceServerException: self.assertEqual( data_item.message(), - "Python model 'decoupled_return_response_error_0' is using " + "Python model 'decoupled_return_response_error_0_0' is using " "the decoupled mode and the execute function must return " - "None.", "Exception message didn't match.") + "None.", + "Exception message didn't match.", + ) def test_decoupled_send_after_close_error(self): model_name = "decoupled_send_after_close_error" @@ -162,10 +235,12 @@ def test_decoupled_send_after_close_error(self): input_data_0 = np.random.random(shape).astype(np.float32) input_data_1 = np.random.random(shape).astype(np.float32) inputs = [ - grpcclient.InferInput("INPUT0", input_data_0.shape, - np_to_triton_dtype(input_data_0.dtype)), - grpcclient.InferInput("INPUT1", input_data_1.shape, - np_to_triton_dtype(input_data_1.dtype)) + grpcclient.InferInput( + "INPUT0", input_data_0.shape, np_to_triton_dtype(input_data_0.dtype) + ), + grpcclient.InferInput( + "INPUT1", input_data_1.shape, np_to_triton_dtype(input_data_1.dtype) + ), ] inputs[0].set_data_from_numpy(input_data_0) inputs[1].set_data_from_numpy(input_data_1) @@ -175,9 +250,75 @@ def test_decoupled_send_after_close_error(self): # way to deliver the error message to the client. The error # will be logged on the server side. time.sleep(4) - self.assertEqual(user_data._completed_requests.qsize(), 0, - "The completed request size must be zero.") + self.assertEqual( + user_data._completed_requests.qsize(), + 0, + "The completed request size must be zero.", + ) + + def test_decoupled_execute_cancel(self): + model_name = "execute_cancel" + log_path = "decoupled_server.log" + execute_delay = 4.0 # seconds + shape = [1, 1] + + user_data = UserData() + with grpcclient.InferenceServerClient("localhost:8001") as client: + client.start_stream(callback=partial(callback, user_data)) + input_data = np.array([[execute_delay]], dtype=np.float32) + inputs = [ + grpcclient.InferInput( + "EXECUTE_DELAY", shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + client.async_stream_infer(model_name, inputs) + time.sleep(2) # model delay for decoupling request and response sender + time.sleep(2) # ensure the request is executing + client.stop_stream(cancel_requests=True) + time.sleep(2) # ensure the cancellation is delivered + + self.assertFalse(user_data._completed_requests.empty()) + while not user_data._completed_requests.empty(): + data_item = user_data._completed_requests.get() + self.assertIsInstance(data_item, InferenceServerException) + self.assertEqual(data_item.status(), "StatusCode.CANCELLED") + + with open(log_path, mode="r", encoding="utf-8", errors="strict") as f: + log_text = f.read() + self.assertIn("[execute_cancel] Request not cancelled at 1.0 s", log_text) + self.assertIn("[execute_cancel] Request cancelled at ", log_text) + + def test_decoupled_raise_exception(self): + # The decoupled_raise_exception model raises an exception for the request. + # This test case is making sure that repeated exceptions are properly handled. + + model_name = "decoupled_raise_exception" + shape = [2, 2] + number_of_requests = 10 + user_data = UserData() + with grpcclient.InferenceServerClient("localhost:8001") as triton_client: + triton_client.start_stream(callback=partial(callback, user_data)) + + input_datas = [] + for i in range(number_of_requests): + input_data = np.random.randn(*shape).astype(np.float32) + input_datas.append(input_data) + inputs = [ + grpcclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + triton_client.async_stream_infer(model_name=model_name, inputs=inputs) + + for i in range(number_of_requests): + result = user_data._completed_requests.get() + self.assertIs(type(result), InferenceServerException) + self.assertIn("Intentional Error", result.message()) + + self.assertTrue(triton_client.is_model_ready(model_name)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py index 56f79f99e6..782e7ec86e 100644 --- a/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py +++ b/qa/L0_backend_python/decoupled/models/decoupled_bls/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,78 +24,102 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils import json +import sys import threading import time + import numpy as np -import asyncio import torch +import triton_python_backend_utils as pb_utils from torch.utils.dlpack import from_dlpack, to_dlpack -import sys class TritonPythonModel: - """ This model sends an error message with the first request. - """ + """This model sends an error message with the first request.""" def initialize(self, args): + logger = pb_utils.Logger + logger.log("Initialize-Specific Msg!", logger.INFO) + logger.log_info("Initialize-Info Msg!") + logger.log_warn("Initialize-Warning Msg!") + logger.log_error("Initialize-Error Msg!") # You must parse model_config. JSON string is not parsed here - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) using_decoupled = pb_utils.using_decoupled_model_transaction_policy( - model_config) + model_config + ) if not using_decoupled: raise pb_utils.TritonModelException( """the model `{}` can generate any number of responses per request, enable decoupled transaction policy in model configuration to - serve this model""".format(args['model_name'])) + serve this model""".format( + args["model_name"] + ) + ) # Get OUT configuration out_config = pb_utils.get_output_config_by_name(model_config, "OUT") # Convert Triton types to numpy types - self.out_dtype = pb_utils.triton_string_to_numpy( - out_config['data_type']) + self.out_dtype = pb_utils.triton_string_to_numpy(out_config["data_type"]) self.inflight_thread_count = 0 self.inflight_thread_count_lck = threading.Lock() + logger = pb_utils.Logger + logger.log("Initialize-Specific Msg!", logger.INFO) + logger.log_info("Initialize-Info Msg!") + logger.log_warn("Initialize-Warning Msg!") + logger.log_error("Initialize-Error Msg!") def execute(self, requests): - """ This function is called on inference request. - """ - + """This function is called on inference request.""" + logger = pb_utils.Logger + logger.log("Execute-Specific Msg!", logger.INFO) + logger.log_info("Execute-Info Msg!") + logger.log_warn("Execute-Warning Msg!") + logger.log_error("Execute-Error Msg!") # Only generate the error for the first request for i, request in enumerate(requests): - request_input = pb_utils.get_input_tensor_by_name(request, 'IN') + request_input = pb_utils.get_input_tensor_by_name(request, "IN") # Sync BLS request infer_request = pb_utils.InferenceRequest( - model_name='identity_fp32', + model_name="identity_fp32", requested_output_names=["OUTPUT0"], - inputs=[pb_utils.Tensor('INPUT0', request_input.as_numpy())]) + inputs=[pb_utils.Tensor("INPUT0", request_input.as_numpy())], + ) infer_response = infer_request.exec() if infer_response.has_error(): raise pb_utils.TritonModelException( f"BLS Response has an error: {infer_response.error().message()}" ) - output0 = pb_utils.get_output_tensor_by_name( - infer_response, "OUTPUT0") + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") if np.any(output0.as_numpy() != request_input.as_numpy()): raise pb_utils.TritonModelException( f"BLS Request input and BLS response output do not match. {request_input.as_numpy()} != {output0.as_numpy()}" ) - thread1 = threading.Thread(target=self.response_thread, - args=(request.get_response_sender(), - pb_utils.get_input_tensor_by_name( - request, 'IN').as_numpy())) + thread1 = threading.Thread( + target=self.response_thread, + args=( + request.get_response_sender(), + pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(), + ), + ) thread1.daemon = True with self.inflight_thread_count_lck: self.inflight_thread_count += 1 thread1.start() + logger = pb_utils.Logger + logger.log("Execute-Specific Msg!", logger.INFO) + logger.log_info("Execute-Info Msg!") + logger.log_warn("Execute-Warning Msg!") + logger.log_error("Execute-Error Msg!") + return None def _get_gpu_bls_outputs(self, input0_pb, input1_pb): @@ -105,16 +129,23 @@ def _get_gpu_bls_outputs(self, input0_pb, input1_pb): Returns True on success and False on failure. """ + logger = pb_utils.Logger + logger.log("_get_gpu_bls_outputs-Specific Msg!", logger.INFO) + logger.log_info("_get_gpu_bls_outputs-Info Msg!") + logger.log_warn("_get_gpu_bls_outputs-Warning Msg!") + logger.log_error("_get_gpu_bls_outputs-Error Msg!") + infer_request = pb_utils.InferenceRequest( - model_name='dlpack_add_sub', + model_name="dlpack_add_sub", inputs=[input0_pb, input1_pb], - requested_output_names=['OUTPUT0', 'OUTPUT1']) + requested_output_names=["OUTPUT0", "OUTPUT1"], + ) infer_response = infer_request.exec() if infer_response.has_error(): return False - output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0') - output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1") if output0 is None or output1 is None: return False @@ -158,44 +189,56 @@ def _get_gpu_bls_outputs(self, input0_pb, input1_pb): return output0.to_dlpack(), output1.to_dlpack() def _test_gpu_bls_add_sub(self, is_input0_gpu, is_input1_gpu): + logger = pb_utils.Logger + logger.log("_test_gpu_bls_add_sub-Specific Msg!", logger.INFO) + logger.log_info("_test_gpu_bls_add_sub-Info Msg!") + logger.log_warn("_test_gpu_bls_add_sub-Warning Msg!") + logger.log_error("_test_gpu_bls_add_sub-Error Msg!") + input0 = torch.rand(16) input1 = torch.rand(16) if is_input0_gpu: - input0 = input0.to('cuda') + input0 = input0.to("cuda") if is_input1_gpu: - input1 = input1.to('cuda') + input1 = input1.to("cuda") - input0_pb = pb_utils.Tensor.from_dlpack('INPUT0', to_dlpack(input0)) - input1_pb = pb_utils.Tensor.from_dlpack('INPUT1', to_dlpack(input1)) + input0_pb = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0)) + input1_pb = pb_utils.Tensor.from_dlpack("INPUT1", to_dlpack(input1)) gpu_bls_return = self._get_gpu_bls_outputs(input0_pb, input1_pb) if gpu_bls_return: output0_dlpack, output1_dlpack = gpu_bls_return else: return False - expected_output_0 = from_dlpack( - input0_pb.to_dlpack()).to('cpu') + from_dlpack( - input1_pb.to_dlpack()).to('cpu') - expected_output_1 = from_dlpack( - input0_pb.to_dlpack()).to('cpu') - from_dlpack( - input1_pb.to_dlpack()).to('cpu') + expected_output_0 = from_dlpack(input0_pb.to_dlpack()).to("cpu") + from_dlpack( + input1_pb.to_dlpack() + ).to("cpu") + expected_output_1 = from_dlpack(input0_pb.to_dlpack()).to("cpu") - from_dlpack( + input1_pb.to_dlpack() + ).to("cpu") output0_matches = torch.all( - expected_output_0 == from_dlpack(output0_dlpack).to('cpu')) + expected_output_0 == from_dlpack(output0_dlpack).to("cpu") + ) output1_matches = torch.all( - expected_output_1 == from_dlpack(output1_dlpack).to('cpu')) + expected_output_1 == from_dlpack(output1_dlpack).to("cpu") + ) if not output0_matches or not output1_matches: return False return True def execute_gpu_bls(self): + logger = pb_utils.Logger + logger.log("execute_gpu_bls-Specific Msg!", logger.INFO) + logger.log_info("execute_gpu_bls-Info Msg!") + logger.log_warn("execute_gpu_bls-Warning Msg!") + logger.log_error("execute_gpu_bls-Error Msg!") for input0_device in [True, False]: for input1_device in [True, False]: - test_status = self._test_gpu_bls_add_sub( - input0_device, input1_device) + test_status = self._test_gpu_bls_add_sub(input0_device, input1_device) if not test_status: return False @@ -205,59 +248,69 @@ def response_thread(self, response_sender, in_input): # The response_sender is used to send response(s) associated with the # corresponding request. # Sleep 5 seconds to make sure the main thread has exited. + logger = pb_utils.Logger + logger.log("response_thread-Specific Msg!", logger.INFO) + logger.log_info("response_thread-Info Msg!") + logger.log_warn("response_thread-Warning Msg!") + logger.log_error("response_thread-Error Msg!") time.sleep(5) status = self.execute_gpu_bls() if not status: - infer_response = pb_utils.InferenceResponse( - error="GPU BLS test failed.") + infer_response = pb_utils.InferenceResponse(error="GPU BLS test failed.") response_sender.send(infer_response) else: in_value = in_input infer_request = pb_utils.InferenceRequest( - model_name='identity_fp32', + model_name="identity_fp32", requested_output_names=["OUTPUT0"], - inputs=[pb_utils.Tensor('INPUT0', in_input)]) + inputs=[pb_utils.Tensor("INPUT0", in_input)], + ) infer_response = infer_request.exec() - output0 = pb_utils.get_output_tensor_by_name( - infer_response, "OUTPUT0") + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") if infer_response.has_error(): response = pb_utils.InferenceResponse( - error=infer_response.error().message()) + error=infer_response.error().message() + ) response_sender.send( - response, - flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) elif np.any(in_input != output0.as_numpy()): error_message = ( "BLS Request input and BLS response output do not match." - f" {in_value} != {output0.as_numpy()}") + f" {in_value} != {output0.as_numpy()}" + ) response = pb_utils.InferenceResponse(error=error_message) response_sender.send( - response, - flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) else: - output_tensors = [pb_utils.Tensor('OUT', in_value)] - response = pb_utils.InferenceResponse( - output_tensors=output_tensors) + output_tensors = [pb_utils.Tensor("OUT", in_value)] + response = pb_utils.InferenceResponse(output_tensors=output_tensors) response_sender.send( - response, - flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) with self.inflight_thread_count_lck: self.inflight_thread_count -= 1 + logger.log("response_thread-Specific Msg!", logger.INFO) + logger.log_info("response_thread-Info Msg!") + logger.log_warn("response_thread-Warning Msg!") + logger.log_error("response_thread-Error Msg!") def finalize(self): """`finalize` is called only once when the model is being unloaded. Implementing `finalize` function is OPTIONAL. This function allows the model to perform any necessary clean ups before exit. """ - print('Finalize invoked') + logger = pb_utils.Logger + logger.log_info("Finalize invoked") inflight_threads = True while inflight_threads: with self.inflight_thread_count_lck: - inflight_threads = (self.inflight_thread_count != 0) + inflight_threads = self.inflight_thread_count != 0 if inflight_threads: time.sleep(0.1) - print('Finalize complete...') + logger.log_info("Finalize complete...") diff --git a/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py new file mode 100644 index 0000000000..8643482912 --- /dev/null +++ b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/1/model.py @@ -0,0 +1,132 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json +import threading +import time + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + """This model sends a BLS request to a decoupled model 'square_int32' and + returns the output from 'square_int32' as responses. + """ + + def initialize(self, args): + # You must parse model_config. JSON string is not parsed here + self.model_config = model_config = json.loads(args["model_config"]) + + using_decoupled = pb_utils.using_decoupled_model_transaction_policy( + model_config + ) + if not using_decoupled: + raise pb_utils.TritonModelException( + """the model `{}` can generate any number of responses per request, + enable decoupled transaction policy in model configuration to + serve this model""".format( + args["model_name"] + ) + ) + + self.inflight_thread_count = 0 + self.inflight_thread_count_lck = threading.Lock() + + def execute(self, requests): + """This function is called on inference request.""" + + for request in requests: + thread = threading.Thread( + target=self.response_thread, + args=( + request.get_response_sender(), + pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(), + ), + ) + thread.daemon = True + with self.inflight_thread_count_lck: + self.inflight_thread_count += 1 + thread.start() + + return None + + def response_thread(self, response_sender, in_value): + infer_request = pb_utils.InferenceRequest( + model_name="square_int32", + requested_output_names=["OUT"], + inputs=[pb_utils.Tensor("IN", in_value)], + ) + infer_responses = infer_request.exec(decoupled=True) + + response_count = 0 + for infer_response in infer_responses: + if len(infer_response.output_tensors()) > 0: + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + if infer_response.has_error(): + response = pb_utils.InferenceResponse( + error=infer_response.error().message() + ) + response_sender.send( + response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) + elif np.any(in_value != output0.as_numpy()): + error_message = ( + "BLS Request input and BLS response output do not match." + f" {in_value} != {output0.as_numpy()}" + ) + response = pb_utils.InferenceResponse(error=error_message) + response_sender.send( + response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) + else: + output_tensors = [pb_utils.Tensor("OUT", output0.as_numpy())] + response = pb_utils.InferenceResponse(output_tensors=output_tensors) + response_sender.send(response) + + response_count += 1 + + if in_value != response_count - 1: + error_message = "Expected {} responses, got {}".format( + in_value, len(infer_responses) - 1 + ) + response = pb_utils.InferenceResponse(error=error_message) + response_sender.send( + response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) + else: + response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + + with self.inflight_thread_count_lck: + self.inflight_thread_count -= 1 + + def finalize(self): + inflight_threads = True + while inflight_threads: + with self.inflight_thread_count_lck: + inflight_threads = self.inflight_thread_count != 0 + if inflight_threads: + time.sleep(0.1) diff --git a/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt new file mode 100644 index 0000000000..23ad453212 --- /dev/null +++ b/qa/L0_backend_python/decoupled/models/decoupled_bls_stream/config.pbtxt @@ -0,0 +1,54 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "decoupled_bls_stream" +backend: "python" + +model_transaction_policy { + decoupled: True +} +input [ + { + name: "IN" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] + +output [ + { + name: "OUT" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] + +instance_group [ + { + count: 1 + kind : KIND_CPU + } +] diff --git a/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py index 1a7bd7abed..3882f0da9c 100644 --- a/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py +++ b/qa/L0_backend_python/decoupled/models/decoupled_execute_error/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,49 +24,55 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils import json import threading import time +import triton_python_backend_utils as pb_utils + class TritonPythonModel: - """ This model sends an error message with the first request. - """ + """This model sends an error message with the first request.""" def initialize(self, args): # You must parse model_config. JSON string is not parsed here - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) using_decoupled = pb_utils.using_decoupled_model_transaction_policy( - model_config) + model_config + ) if not using_decoupled: raise pb_utils.TritonModelException( """the model `{}` can generate any number of responses per request, enable decoupled transaction policy in model configuration to - serve this model""".format(args['model_name'])) + serve this model""".format( + args["model_name"] + ) + ) # Get OUT configuration out_config = pb_utils.get_output_config_by_name(model_config, "OUT") # Convert Triton types to numpy types - self.out_dtype = pb_utils.triton_string_to_numpy( - out_config['data_type']) + self.out_dtype = pb_utils.triton_string_to_numpy(out_config["data_type"]) self.inflight_thread_count = 0 self.inflight_thread_count_lck = threading.Lock() def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" # Only generate the error for the first request for i, request in enumerate(requests): # Start a separate thread to send the responses for the request. - thread = threading.Thread(target=self.response_thread, - args=(request.get_response_sender(), i, - pb_utils.get_input_tensor_by_name( - request, 'IN').as_numpy())) + thread = threading.Thread( + target=self.response_thread, + args=( + request.get_response_sender(), + i, + pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(), + ), + ) thread.daemon = True with self.inflight_thread_count_lck: @@ -86,9 +92,10 @@ def response_thread(self, response_sender, index, in_input): out_output = pb_utils.Tensor("OUT", in_value) if index == 0: - error = pb_utils.TritonError('An error occured during execution') - response = pb_utils.InferenceResponse(output_tensors=[out_output], - error=error) + error = pb_utils.TritonError("An error occurred during execution") + response = pb_utils.InferenceResponse( + output_tensors=[out_output], error=error + ) else: response = pb_utils.InferenceResponse(output_tensors=[out_output]) response_sender.send(response) @@ -96,8 +103,7 @@ def response_thread(self, response_sender, index, in_input): # We must close the response sender to indicate to Triton that we are # done sending responses for the corresponding request. We can't use the # response sender after closing it. - response_sender.send( - flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) with self.inflight_thread_count_lck: self.inflight_thread_count -= 1 @@ -107,13 +113,13 @@ def finalize(self): Implementing `finalize` function is OPTIONAL. This function allows the model to perform any necessary clean ups before exit. """ - print('Finalize invoked') + print("Finalize invoked") inflight_threads = True while inflight_threads: with self.inflight_thread_count_lck: - inflight_threads = (self.inflight_thread_count != 0) + inflight_threads = self.inflight_thread_count != 0 if inflight_threads: time.sleep(0.1) - print('Finalize complete...') + print("Finalize complete...") diff --git a/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py new file mode 100644 index 0000000000..03a19db98d --- /dev/null +++ b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/1/model.py @@ -0,0 +1,35 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + def initialize(self, args): + pass + + def execute(self, requests): + for request in requests: + raise Exception("Intentional Error") + return None diff --git a/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt new file mode 100644 index 0000000000..046687dfe7 --- /dev/null +++ b/qa/L0_backend_python/decoupled/models/decoupled_raise_exception/config.pbtxt @@ -0,0 +1,55 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "decoupled_raise_exception" +backend: "python" +max_batch_size: 64 + +model_transaction_policy { + decoupled: True +} +input [ + { + name: "IN" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +output [ + { + name: "OUT" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind : KIND_CPU + } +] diff --git a/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py index 959fb0fcae..ecde9c7168 100644 --- a/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py +++ b/qa/L0_backend_python/decoupled/models/decoupled_return_response_error/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,39 +24,43 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np import json + import triton_python_backend_utils as pb_utils class TritonPythonModel: - """ This model tries to return a response directly from + """This model tries to return a response directly from execute function when configured as decoupled model. """ def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) using_decoupled = pb_utils.using_decoupled_model_transaction_policy( - model_config) + model_config + ) if not using_decoupled: raise pb_utils.TritonModelException( """the model `{}` can generate any number of responses per request, - enable decoupled transaction policy in model configuration to - serve this model""".format(args['model_name'])) + enable decoupled transaction policy in model configuration to + serve this model""".format( + args["model_name"] + ) + ) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) def execute(self, requests): - """ Tries to create a response sender object and use that + """Tries to create a response sender object and use that for sending the response. """ @@ -67,13 +71,12 @@ def execute(self, requests): for request in requests: in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") - out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(), - in_0.as_numpy() - in_1.as_numpy()) + out_0, out_1 = ( + in_0.as_numpy() + in_1.as_numpy(), + in_0.as_numpy() - in_1.as_numpy(), + ) - out_tensor_0 = pb_utils.Tensor("OUTPUT0", - out_0.astype(output0_dtype)) - out_tensor_1 = pb_utils.Tensor("OUTPUT1", - out_1.astype(output1_dtype)) - responses.append( - pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype)) + responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) return responses diff --git a/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py b/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py index 296269bb27..52aa17ac0d 100644 --- a/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py +++ b/qa/L0_backend_python/decoupled/models/decoupled_send_after_close_error/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,46 +24,51 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np import json + import triton_python_backend_utils as pb_utils class TritonPythonModel: - """ This model tries to send response after closing + """This model tries to send response after closing the response_sender. """ def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) using_decoupled = pb_utils.using_decoupled_model_transaction_policy( - model_config) + model_config + ) if not using_decoupled: raise pb_utils.TritonModelException( """the model `{}` can generate any number of responses per request, - enable decoupled transaction policy in model configuration to - serve this model""".format(args['model_name'])) + enable decoupled transaction policy in model configuration to + serve this model""".format( + args["model_name"] + ) + ) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) def execute(self, requests): - """ Create a response sender object and use that + """Create a response sender object and use that for sending the response. """ # This model does not support batching, so 'request_count' should always be 1. if len(requests) != 1: - raise pb_utils.TritonModelException("unsupported batch size " + - len(requests)) + raise pb_utils.TritonModelException( + "unsupported batch size " + len(requests) + ) output0_dtype = self.output0_dtype output1_dtype = self.output1_dtype @@ -71,13 +76,14 @@ def execute(self, requests): response_sender = requests[0].get_response_sender() in_0 = pb_utils.get_input_tensor_by_name(requests[0], "INPUT0") in_1 = pb_utils.get_input_tensor_by_name(requests[0], "INPUT1") - out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(), - in_0.as_numpy() - in_1.as_numpy()) + out_0, out_1 = ( + in_0.as_numpy() + in_1.as_numpy(), + in_0.as_numpy() - in_1.as_numpy(), + ) out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype)) response = pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]) - response_sender.send( - flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) response_sender.send(response) diff --git a/qa/L0_backend_python/decoupled/test.sh b/qa/L0_backend_python/decoupled/test.sh old mode 100644 new mode 100755 index 5c73af6c4a..db8d4625f1 --- a/qa/L0_backend_python/decoupled/test.sh +++ b/qa/L0_backend_python/decoupled/test.sh @@ -1,4 +1,5 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/bin/bash +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,14 +26,17 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. CLIENT_PY=./decoupled_test.py -CLIENT_LOG="./client.log" -EXPECTED_NUM_TESTS="4" +CLIENT_LOG="./decoupled_client.log" +EXPECTED_NUM_TESTS="7" TEST_RESULT_FILE='test_results.txt' TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./decoupled_server.log" + +pip3 uninstall -y torch +pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html RET=0 source ../../common/util.sh @@ -46,6 +50,43 @@ mkdir -p models/dlpack_add_sub/1/ cp ../../python_models/dlpack_add_sub/model.py models/dlpack_add_sub/1/ cp ../../python_models/dlpack_add_sub/config.pbtxt models/dlpack_add_sub/ +mkdir -p models/execute_cancel/1/ +cp ../../python_models/execute_cancel/model.py ./models/execute_cancel/1/ +cp ../../python_models/execute_cancel/config.pbtxt ./models/execute_cancel/ +echo "model_transaction_policy { decoupled: True }" >> ./models/execute_cancel/config.pbtxt + +git clone https://github.com/triton-inference-server/python_backend -b $PYTHON_BACKEND_REPO_TAG +mkdir -p models/square_int32/1/ +cp python_backend/examples/decoupled/square_model.py models/square_int32/1/model.py +cp python_backend/examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt + +function verify_log_counts () { + if [ `grep -c "Specific Msg!" $SERVER_LOG` -lt 1 ]; then + echo -e "\n***\n*** Test Failed: Specific Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Info Msg!" $SERVER_LOG` -lt 1 ]; then + echo -e "\n***\n*** Test Failed: Info Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Warning Msg!" $SERVER_LOG` -lt 1 ]; then + echo -e "\n***\n*** Test Failed: Warning Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Error Msg!" $SERVER_LOG` -lt 1 ]; then + echo -e "\n***\n*** Test Failed: Error Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Finalize invoked" $SERVER_LOG` -ne 3 ]; then + echo -e "\n***\n*** Test Failed: 'Finalize invoked' message missing\n***" + RET=1 + fi + if [ `grep -c "Finalize complete..." $SERVER_LOG` -ne 3 ]; then + echo -e "\n***\n*** Test Failed: 'Finalize complete...' message missing\n***" + RET=1 + fi +} + run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -72,6 +113,8 @@ set -e kill $SERVER_PID wait $SERVER_PID +verify_log_counts + if [ $RET -eq 1 ]; then cat $CLIENT_LOG cat $SERVER_LOG diff --git a/qa/L0_backend_python/ensemble/ensemble_test.py b/qa/L0_backend_python/ensemble/ensemble_test.py old mode 100644 new mode 100755 index 831f1fa5a3..9fb60e5a4e --- a/qa/L0_backend_python/ensemble/ensemble_test.py +++ b/qa/L0_backend_python/ensemble/ensemble_test.py @@ -1,4 +1,6 @@ -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,23 +27,23 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../../common") -import test_util as tu +import unittest + +import numpy as np import shm_util +import test_util as tu import tritonclient.http as httpclient from tritonclient.utils import * -import numpy as np -import unittest class EnsembleTest(tu.TestResultCollector): - def setUp(self): self._shm_leak_detector = shm_util.ShmLeakDetector() - def test_ensemble(self): - model_name = "ensemble" + def infer(self, model_name): shape = [16] with self._shm_leak_detector.Probe() as shm_probe: with httpclient.InferenceServerClient("localhost:8000") as client: @@ -49,47 +51,37 @@ def test_ensemble(self): input_data_1 = np.random.random(shape).astype(np.float32) inputs = [ httpclient.InferInput( - "INPUT0", input_data_0.shape, - np_to_triton_dtype(input_data_0.dtype)), + "INPUT0", + input_data_0.shape, + np_to_triton_dtype(input_data_0.dtype), + ), httpclient.InferInput( - "INPUT1", input_data_1.shape, - np_to_triton_dtype(input_data_1.dtype)) + "INPUT1", + input_data_1.shape, + np_to_triton_dtype(input_data_1.dtype), + ), ] inputs[0].set_data_from_numpy(input_data_0) inputs[1].set_data_from_numpy(input_data_1) result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') - output1 = result.as_numpy('OUTPUT1') + output0 = result.as_numpy("OUTPUT0") + output1 = result.as_numpy("OUTPUT1") self.assertIsNotNone(output0) self.assertIsNotNone(output1) - self.assertTrue(np.allclose(output0, 2 * input_data_0)) - self.assertTrue(np.allclose(output1, 2 * input_data_1)) + # Set a big enough tolerance to reduce intermittence. May be + # better to test integer outputs in the future for consistency. + self.assertTrue(np.allclose(output0, 2 * input_data_0, atol=1e-06)) + self.assertTrue(np.allclose(output1, 2 * input_data_1, atol=1e-06)) - model_name = "ensemble_gpu" - with self._shm_leak_detector.Probe() as shm_probe: - with httpclient.InferenceServerClient("localhost:8000") as client: - input_data_0 = np.random.random(shape).astype(np.float32) - input_data_1 = np.random.random(shape).astype(np.float32) - inputs = [ - httpclient.InferInput( - "INPUT0", input_data_0.shape, - np_to_triton_dtype(input_data_0.dtype)), - httpclient.InferInput( - "INPUT1", input_data_1.shape, - np_to_triton_dtype(input_data_1.dtype)) - ] - inputs[0].set_data_from_numpy(input_data_0) - inputs[1].set_data_from_numpy(input_data_1) - result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') - output1 = result.as_numpy('OUTPUT1') - self.assertIsNotNone(output0) - self.assertIsNotNone(output1) + def test_ensemble(self): + model_name = "ensemble" + self.infer(model_name) - self.assertTrue(np.allclose(output0, 2 * input_data_0)) - self.assertTrue(np.allclose(output1, 2 * input_data_1)) + def test_ensemble_gpu(self): + model_name = "ensemble_gpu" + self.infer(model_name) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/ensemble/test.sh b/qa/L0_backend_python/ensemble/test.sh old mode 100644 new mode 100755 index cd1018733b..c9292c4f4a --- a/qa/L0_backend_python/ensemble/test.sh +++ b/qa/L0_backend_python/ensemble/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,9 +25,8 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -CLIENT_PY=./lifecycle_test.py -CLIENT_LOG="./client.log" -EXPECTED_NUM_TESTS="1" +CLIENT_LOG="./ensemble_client.log" +EXPECTED_NUM_TESTS="2" TEST_RESULT_FILE='test_results.txt' source ../common.sh source ../../common/util.sh @@ -36,7 +35,7 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./ensemble_server.log" RET=0 rm -rf models/ $CLIENT_LOG @@ -47,14 +46,10 @@ cp ../../python_models/ensemble/config.pbtxt ./models/ensemble mkdir -p models/add_sub_1/1/ cp ../../python_models/add_sub/config.pbtxt ./models/add_sub_1 -(cd models/add_sub_1 && \ - sed -i "s/^name:.*/name: \"add_sub_1\"/" config.pbtxt) cp ../../python_models/add_sub/model.py ./models/add_sub_1/1/ mkdir -p models/add_sub_2/1/ cp ../../python_models/add_sub/config.pbtxt ./models/add_sub_2/ -(cd models/add_sub_2 && \ - sed -i "s/^name:.*/name: \"add_sub_2\"/" config.pbtxt) cp ../../python_models/add_sub/model.py ./models/add_sub_2/1/ # Ensemble GPU Model diff --git a/qa/L0_backend_python/env/test.sh b/qa/L0_backend_python/env/test.sh old mode 100644 new mode 100755 index 361635a9c4..e1106f8e79 --- a/qa/L0_backend_python/env/test.sh +++ b/qa/L0_backend_python/env/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,15 +25,15 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -CLIENT_LOG="./client.log" +CLIENT_LOG="./env_client.log" source ../common.sh source ../../common/util.sh SERVER=/opt/tritonserver/bin/tritonserver -BASE_SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --strict-model-config=false" +BASE_SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --disable-auto-complete-config" PYTHON_BACKEND_BRANCH=$PYTHON_BACKEND_REPO_TAG SERVER_ARGS=$BASE_SERVER_ARGS -SERVER_LOG="./inference_server.log" +SERVER_LOG="./env_server.log" RET=0 @@ -48,6 +48,9 @@ install_conda create_conda_env "3.7" "python-3-7" conda install numpy=1.20.1 -y conda install tensorflow=2.1.0 -y +conda install -c conda-forge libstdcxx-ng=12 -y + +PY37_VERSION_STRING="Python version is 3.7, NumPy version is 1.20.1, and Tensorflow version is 2.1.0" create_python_backend_stub conda-pack -o python3.7.tar.gz path_to_conda_pack=`pwd`/python3.7.tar.gz @@ -60,12 +63,38 @@ cp ../../python_models/python_version/model.py ./models/python_3_7/1/ cp python_backend/builddir/triton_python_backend_stub ./models/python_3_7 conda deactivate +# Use python-3-7 without conda pack +# Create a model with python 3.7 version and numpy 1.20.3 to distinguish from +# previous test. +# Tensorflow 2.1.0 only works with Python 3.4 - 3.7. Successful execution of +# the Python model indicates that the environment has been setup correctly. +path_to_conda_pack="$PWD/python-3-7-1" +create_conda_env_with_specified_path "3.7" $path_to_conda_pack +conda install numpy=1.20.3 -y +conda install tensorflow=2.1.0 -y +conda install -c conda-forge libstdcxx-ng=12 -y + +PY37_1_VERSION_STRING="Python version is 3.7, NumPy version is 1.20.3, and Tensorflow version is 2.1.0" +create_python_backend_stub +mkdir -p models/python_3_7_1/1/ +cp ../../python_models/python_version/config.pbtxt ./models/python_3_7_1 +(cd models/python_3_7_1 && \ + sed -i "s/^name:.*/name: \"python_3_7_1\"/" config.pbtxt && \ + echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt) +cp ../../python_models/python_version/model.py ./models/python_3_7_1/1/ +# Copy activate script to folder +cp $path_to_conda_pack/lib/python3.7/site-packages/conda_pack/scripts/posix/activate $path_to_conda_pack/bin/. +cp python_backend/builddir/triton_python_backend_stub ./models/python_3_7_1 +conda deactivate + # Create a model with python 3.6 version # Tensorflow 2.1.0 only works with Python 3.4 - 3.7. Successful execution of # the Python model indicates that the environment has been setup correctly. create_conda_env "3.6" "python-3-6" +conda install -c conda-forge libstdcxx-ng=12 -y conda install numpy=1.18.1 -y conda install tensorflow=2.1.0 -y +PY36_VERSION_STRING="Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0" conda-pack -o python3.6.tar.gz # Test relative execution env path @@ -79,21 +108,26 @@ cp python3.6.tar.gz models/python_3_6/python_3_6_environment.tar.gz echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}" >> config.pbtxt) cp ../../python_models/python_version/model.py ./models/python_3_6/1/ cp python_backend/builddir/triton_python_backend_stub ./models/python_3_6 +conda deactivate -# Test conda env without custom Python backend stub -# Tensorflow 2.3.0 only works with Python 3.5 - 3.8. -path_to_conda_pack='$$TRITON_MODEL_DIRECTORY/python_3_8_environment.tar.gz' -create_conda_env "3.8" "python-3-8" -conda install numpy=1.19.1 -y -conda install tensorflow=2.3.0 -y -conda-pack -o python3.8.tar.gz -mkdir -p models/python_3_8/1/ -cp ../../python_models/python_version/config.pbtxt ./models/python_3_8 -cp python3.8.tar.gz models/python_3_8/python_3_8_environment.tar.gz -(cd models/python_3_8 && \ - sed -i "s/^name:.*/name: \"python_3_8\"/" config.pbtxt && \ +# Test conda env without custom Python backend stub This environment should +# always use the default Python version shipped in the container. For Ubuntu 22.04 +# it is Python 3.10 and for Ubuntu 20.04 is 3.8 +path_to_conda_pack='$$TRITON_MODEL_DIRECTORY/python_3_10_environment.tar.gz' +create_conda_env "3.10" "python-3-10" +conda install -c conda-forge libstdcxx-ng=12 -y +conda install numpy=1.23.4 -y +conda install tensorflow=2.10.0 -y +PY310_VERSION_STRING="Python version is 3.10, NumPy version is 1.23.4, and Tensorflow version is 2.10.0" +conda pack -o python3.10.tar.gz +mkdir -p models/python_3_10/1/ +cp ../../python_models/python_version/config.pbtxt ./models/python_3_10 +cp python3.10.tar.gz models/python_3_10/python_3_10_environment.tar.gz +(cd models/python_3_10 && \ + sed -i "s/^name:.*/name: \"python_3_10\"/" config.pbtxt && \ echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}" >> config.pbtxt) -cp ../../python_models/python_version/model.py ./models/python_3_8/1/ +cp ../../python_models/python_version/model.py ./models/python_3_10/1/ +conda deactivate rm -rf ./miniconda run_server @@ -107,31 +141,81 @@ kill $SERVER_PID wait $SERVER_PID set +e -grep "Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0" $SERVER_LOG -if [ $? -ne 0 ]; then - cat $SERVER_LOG - echo -e "\n***\n*** Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0 was not found in Triton logs. \n***" - RET=1 -fi +for EXPECTED_VERSION_STRING in "$PY36_VERSION_STRING" "$PY37_VERSION_STRING" "$PY37_1_VERSION_STRING" "$PY310_VERSION_STRING"; do + grep "$EXPECTED_VERSION_STRING" $SERVER_LOG + if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** $EXPECTED_VERSION_STRING was not found in Triton logs. \n***" + RET=1 + fi +done -grep "Python version is 3.7, NumPy version is 1.20.1, and Tensorflow version is 2.1.0" $SERVER_LOG -if [ $? -ne 0 ]; then +# Test default (non set) locale in python stub processes +# NOTE: In certain pybind versions, the locale settings may not be propagated from parent to +# stub processes correctly. See https://github.com/triton-inference-server/python_backend/pull/260. +export LC_ALL=INVALID +grep "Locale is (None, None)" $SERVER_LOG + if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Default unset Locale was not found in Triton logs. \n***" + RET=1 + fi +set -e + +rm $SERVER_LOG + +# Test locale set via environment variable in python stub processes +# NOTE: In certain pybind versions, the locale settings may not be propagated from parent to +# stub processes correctly. See https://github.com/triton-inference-server/python_backend/pull/260. +export LC_ALL=C.UTF-8 +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" cat $SERVER_LOG - echo -e "\n***\n*** Python version is 3.7, NumPy version is 1.20.1, and Tensorflow version is 2.1.0 was not found in Triton logs. \n***" - RET=1 + exit 1 fi -grep "Python version is 3.8, NumPy version is 1.19.1, and Tensorflow version is 2.3.0" $SERVER_LOG -if [ $? -ne 0 ]; then +kill $SERVER_PID +wait $SERVER_PID + +set +e +grep "Locale is ('en_US', 'UTF-8')" $SERVER_LOG + if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Locale UTF-8 was not found in Triton logs. \n***" + RET=1 + fi +set -e + +rm $SERVER_LOG + +## Test re-extraction of environment. +SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --model-control-mode=explicit" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" cat $SERVER_LOG - echo -e "\n***\n*** Python version is 3.8, NumPy version is 1.19.1, and Tensorflow version is 2.3.0 was not found in Triton logs. \n***" - RET=1 + exit 1 fi -grep "no version information available (required by /bin/bash)." $SERVER_LOG -if [ $? -eq 0 ]; then +# The environment should be extracted +curl -v -X POST localhost:8000/v2/repository/models/python_3_10/load +touch -m models/python_3_10/1/model.py +# The environment should not be re-extracted +curl -v -X POST localhost:8000/v2/repository/models/python_3_10/load +touch -m models/python_3_10/python_3_10_environment.tar.gz +# The environment should be re-extracted +curl -v -X POST localhost:8000/v2/repository/models/python_3_10/load + +kill $SERVER_PID +wait $SERVER_PID + +set +e + +PY310_ENV_EXTRACTION="Extracting Python execution env" +if [ `grep -c "${PY310_ENV_EXTRACTION}" ${SERVER_LOG}` != "2" ]; then cat $SERVER_LOG - echo -e "\n***\n*** \"no version information available (required by /bin/bash).\" was found in the server logs. \n***" + echo -e "\n***\n*** Python execution environment should be extracted exactly twice. \n***" RET=1 fi set -e @@ -156,12 +240,15 @@ aws s3 mb "${BUCKET_URL}" BUCKET_URL=${BUCKET_URL%/} BUCKET_URL_SLASH="${BUCKET_URL}/" -# Model Python 3.7 contains absolute paths and because of this it cannot be used +# Remove Python 3.7 model because it contains absolute paths and cannot be used # with S3. rm -rf models/python_3_7 -rm $SERVER_LOG +# Test with the bucket url as model repository aws s3 cp models/ "${BUCKET_URL_SLASH}" --recursive --include "*" + +rm $SERVER_LOG + SERVER_ARGS="--model-repository=$BUCKET_URL_SLASH --log-verbose=1" run_server if [ "$SERVER_PID" == "0" ]; then @@ -174,14 +261,49 @@ kill $SERVER_PID wait $SERVER_PID set +e -grep "Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0" $SERVER_LOG +grep "$PY36_VERSION_STRING" $SERVER_LOG if [ $? -ne 0 ]; then cat $SERVER_LOG - echo -e "\n***\n*** Python version is 3.6, NumPy version is 1.18.1, and Tensorflow version is 2.1.0 was not found in Triton logs. \n***" + echo -e "\n***\n*** $PY36_VERSION_STRING was not found in Triton logs. \n***" RET=1 fi set -e +# Clean up bucket contents +aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" + +# Test with EXECUTION_ENV_PATH outside the model directory +sed -i "s/TRITON_MODEL_DIRECTORY\/python_3_6_environment/TRITON_MODEL_DIRECTORY\/..\/python_3_6_environment/" models/python_3_6/config.pbtxt +mv models/python_3_6/python_3_6_environment.tar.gz models +sed -i "s/\$\$TRITON_MODEL_DIRECTORY\/python_3_10_environment/s3:\/\/triton-bucket-${CI_JOB_ID}\/python_3_10_environment/" models/python_3_10/config.pbtxt +mv models/python_3_10/python_3_10_environment.tar.gz models + +aws s3 cp models/ "${BUCKET_URL_SLASH}" --recursive --include "*" + +rm $SERVER_LOG + +SERVER_ARGS="--model-repository=$BUCKET_URL_SLASH --log-verbose=1" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +kill $SERVER_PID +wait $SERVER_PID + +set +e +for EXPECTED_VERSION_STRING in "$PY36_VERSION_STRING" "$PY310_VERSION_STRING"; do + grep "$EXPECTED_VERSION_STRING" $SERVER_LOG + if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** $EXPECTED_VERSION_STRING was not found in Triton logs. \n***" + RET=1 + fi +done +set -e + # Clean up bucket contents and delete bucket aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" aws s3 rb "${BUCKET_URL}" diff --git a/qa/L0_backend_python/examples/test.sh b/qa/L0_backend_python/examples/test.sh old mode 100644 new mode 100755 index bde23b3506..4f9cddab8d --- a/qa/L0_backend_python/examples/test.sh +++ b/qa/L0_backend_python/examples/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -32,22 +32,32 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/python_backend/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./examples_server.log" RET=0 rm -fr *.log python_backend/ -# # Skip torch install on Jetson since it is already installed. +# Install torch +pip3 uninstall -y torch if [ "$TEST_JETSON" == "0" ]; then - pip3 uninstall -y torch - pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html + pip3 install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html torchvision==0.15.0+cu117 +else + pip3 install torch==2.0.0 -f https://download.pytorch.org/whl/torch_stable.html torchvision==0.15.0 +fi + +# Install `validators` for Model Instance Kind example +pip3 install validators + +# Install JAX +if [ "$TEST_JETSON" == "0" ]; then + pip3 install --upgrade "jax[cuda12_local]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html fi git clone https://github.com/triton-inference-server/python_backend -b $PYTHON_BACKEND_REPO_TAG cd python_backend # Example 1 -CLIENT_LOG="./add_sub_client.log" +CLIENT_LOG="./examples_add_sub_client.log" mkdir -p models/add_sub/1/ cp examples/add_sub/model.py models/add_sub/1/model.py cp examples/add_sub/config.pbtxt models/add_sub/config.pbtxt @@ -77,7 +87,7 @@ kill $SERVER_PID wait $SERVER_PID # Example 2 -CLIENT_LOG="./pytorch_client.log" +CLIENT_LOG="./examples_pytorch_client.log" mkdir -p models/pytorch/1/ cp examples/pytorch/model.py models/pytorch/1/model.py cp examples/pytorch/config.pbtxt models/pytorch/config.pbtxt @@ -108,8 +118,43 @@ wait $SERVER_PID # Example 3 +# JAX AddSub +# JAX is not supported on Jetson +if [ "$TEST_JETSON" == "0" ]; then + CLIENT_LOG="./examples_jax_client.log" + mkdir -p models/jax/1/ + cp examples/jax/model.py models/jax/1/model.py + cp examples/jax/config.pbtxt models/jax/config.pbtxt + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + RET=1 + fi + + set +e + python3 examples/jax/client.py > $CLIENT_LOG + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify jax example. \n***" + RET=1 + fi + + grep "PASS" $CLIENT_LOG + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify jax example. \n***" + cat $CLIENT_LOG + RET=1 + fi + set -e + + kill $SERVER_PID + wait $SERVER_PID +fi + +# Example 4 + # BLS Sync -CLIENT_LOG="./sync_client.log" +CLIENT_LOG="./examples_sync_client.log" mkdir -p models/bls_sync/1 cp examples/bls/sync_model.py models/bls_sync/1/model.py cp examples/bls/sync_config.pbtxt models/bls_sync/config.pbtxt @@ -138,10 +183,10 @@ set -e kill $SERVER_PID wait $SERVER_PID -# Example 4 +# Example 5 # Decoupled Repeat -CLIENT_LOG="./repeat_client.log" +CLIENT_LOG="./examples_repeat_client.log" mkdir -p models/repeat_int32/1/ cp examples/decoupled/repeat_model.py models/repeat_int32/1/model.py cp examples/decoupled/repeat_config.pbtxt models/repeat_int32/config.pbtxt @@ -170,10 +215,10 @@ set -e kill $SERVER_PID wait $SERVER_PID -# Example 5 +# Example 6 # Decoupled Square -CLIENT_LOG="./square_client.log" +CLIENT_LOG="./examples_square_client.log" mkdir -p models/square_int32/1/ cp examples/decoupled/square_model.py models/square_int32/1/model.py cp examples/decoupled/square_config.pbtxt models/square_int32/config.pbtxt @@ -209,7 +254,7 @@ wait $SERVER_PID # Having multiple python versions lead to build issues. # Anaconda is not officially supported on Jetson. if [ "$TEST_JETSON" == "0" ]; then - CLIENT_LOG="./async_client.log" + CLIENT_LOG="./examples_async_client.log" mkdir -p models/bls_async/1 cp examples/bls/async_model.py models/bls_async/1/model.py cp examples/bls/async_config.pbtxt models/bls_async/config.pbtxt @@ -241,17 +286,11 @@ if [ "$TEST_JETSON" == "0" ]; then fi # Auto Complete Model Configuration Example -CLIENT_LOG="./auto_complete_client.log" +CLIENT_LOG="./examples_auto_complete_client.log" mkdir -p models/nobatch_auto_complete/1/ mkdir -p models/batch_auto_complete/1/ cp examples/auto_complete/nobatch_model.py models/nobatch_auto_complete/1/model.py cp examples/auto_complete/batch_model.py models/batch_auto_complete/1/model.py -if [ "$TEST_JETSON" == "1" ]; then - echo -e 'name: "nobatch_auto_complete" \ninstance_group [{ kind: KIND_CPU }]' > \ - models/nobatch_auto_complete/config.pbtxt - echo -e 'name: "batch_auto_complete" \ninstance_group [{ kind: KIND_CPU }]' > \ - models/batch_auto_complete/config.pbtxt -fi SERVER_ARGS="$SERVER_ARGS --strict-model-config=false" @@ -280,6 +319,132 @@ set -e kill $SERVER_PID wait $SERVER_PID +# BLS Decoupled Sync +CLIENT_LOG="./examples_bls_decoupled_sync_client.log" +mkdir -p models/bls_decoupled_sync/1 +cp examples/bls_decoupled/sync_model.py models/bls_decoupled_sync/1/model.py +cp examples/bls_decoupled/sync_config.pbtxt models/bls_decoupled_sync/config.pbtxt +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + RET=1 +fi + +set +e +python3 examples/bls_decoupled/sync_client.py > $CLIENT_LOG +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify BLS Decoupled Sync example. \n***" + RET=1 +fi + +grep "PASS" $CLIENT_LOG +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify BLS Decoupled Sync example. \n***" + cat $CLIENT_LOG + RET=1 +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# BLS Decoupled Async +if [ "$TEST_JETSON" == "0" ]; then + CLIENT_LOG="./examples_bls_decoupled_async_client.log" + mkdir -p models/bls_decoupled_async/1 + cp examples/bls_decoupled/async_model.py models/bls_decoupled_async/1/model.py + cp examples/bls_decoupled/async_config.pbtxt models/bls_decoupled_async/config.pbtxt + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + RET=1 + fi + + set +e + python3 examples/bls_decoupled/async_client.py > $CLIENT_LOG + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify BLS Decoupled Async example. \n***" + RET=1 + fi + + grep "PASS" $CLIENT_LOG + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify BLS Decoupled Async example. \n***" + cat $CLIENT_LOG + RET=1 + fi + + set -e + + kill $SERVER_PID + wait $SERVER_PID +fi + +# Example 7 + +# Model Instance Kind +CLIENT_LOG="./examples_model_instance_kind.log" +mkdir -p models/resnet50/1 +cp examples/instance_kind/model.py models/resnet50/1/ +cp examples/instance_kind/config.pbtxt models/resnet50/ +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + RET=1 +fi + +set +e +python3 examples/instance_kind/client.py --label_file examples/instance_kind/resnet50_labels.txt > $CLIENT_LOG +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify Model instance Kind example. \n***" + RET=1 +fi + +grep "PASS" $CLIENT_LOG +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify Model Instance Kind example. Example failed to pass. \n***" + cat $CLIENT_LOG + RET=1 +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# Custom Metrics +CLIENT_LOG="./examples_custom_metrics_client.log" +mkdir -p models/custom_metrics/1 +cp examples/custom_metrics/model.py models/custom_metrics/1/model.py +cp examples/custom_metrics/config.pbtxt models/custom_metrics/config.pbtxt +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + RET=1 +fi + +set +e +python3 examples/custom_metrics/client.py > $CLIENT_LOG +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify Custom Metrics example. \n***" + RET=1 +fi + +grep "PASS" $CLIENT_LOG +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed to verify Custom Metrics example. \n***" + cat $CLIENT_LOG + RET=1 +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + + if [ $RET -eq 0 ]; then echo -e "\n***\n*** Example verification test PASSED.\n***" else diff --git a/qa/L0_backend_python/io/io_test.py b/qa/L0_backend_python/io/io_test.py old mode 100644 new mode 100755 index 8a88837478..ff67e8c0ff --- a/qa/L0_backend_python/io/io_test.py +++ b/qa/L0_backend_python/io/io_test.py @@ -1,4 +1,6 @@ -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,22 +30,21 @@ sys.path.append("../../common") +import os +import queue +import unittest from functools import partial -import test_util as tu + +import numpy as np import shm_util -import tritonclient.http as httpclient +import test_util as tu import tritonclient.grpc as grpcclient from tritonclient.utils import * -import numpy as np -import unittest -import queue -import os -TRIAL = os.getenv('TRIAL') +TRIAL = os.getenv("TRIAL") class UserData: - def __init__(self): self._completed_requests = queue.Queue() @@ -56,55 +57,102 @@ def callback(user_data, result, error): class IOTest(tu.TestResultCollector): - def setUp(self): self._shm_leak_detector = shm_util.ShmLeakDetector() + self._client = grpcclient.InferenceServerClient("localhost:8001") - def _run_test(self): + def _run_ensemble_test(self): model_name = "ensemble_io" user_data = UserData() - with grpcclient.InferenceServerClient("localhost:8001") as client: - input0 = np.random.random([1000]).astype(np.float32) - client.start_stream(callback=partial(callback, user_data)) - for model_1_in_gpu in [True, False]: - for model_2_in_gpu in [True, False]: - for model_3_in_gpu in [True, False]: - gpu_output = np.asarray( - [model_1_in_gpu, model_2_in_gpu, model_3_in_gpu], - dtype=bool) - inputs = [ - grpcclient.InferInput( - "INPUT0", input0.shape, - np_to_triton_dtype(input0.dtype)), - grpcclient.InferInput( - "GPU_OUTPUT", gpu_output.shape, - np_to_triton_dtype(gpu_output.dtype)) - ] - inputs[0].set_data_from_numpy(input0) - inputs[1].set_data_from_numpy(gpu_output) - client.async_stream_infer(model_name=model_name, - inputs=inputs) - if TRIAL == 'default': + input0 = np.random.random([1000]).astype(np.float32) + self._client.start_stream(callback=partial(callback, user_data)) + for model_1_in_gpu in [True, False]: + for model_2_in_gpu in [True, False]: + for model_3_in_gpu in [True, False]: + gpu_output = np.asarray( + [model_1_in_gpu, model_2_in_gpu, model_3_in_gpu], dtype=bool + ) + inputs = [ + grpcclient.InferInput( + "INPUT0", input0.shape, np_to_triton_dtype(input0.dtype) + ), + grpcclient.InferInput( + "GPU_OUTPUT", + gpu_output.shape, + np_to_triton_dtype(gpu_output.dtype), + ), + ] + inputs[0].set_data_from_numpy(input0) + inputs[1].set_data_from_numpy(gpu_output) + self._client.async_stream_infer( + model_name=model_name, inputs=inputs + ) + if TRIAL == "default": + result = user_data._completed_requests.get() + output0 = result.as_numpy("OUTPUT0") + self.assertIsNotNone(output0) + self.assertTrue(np.all(output0 == input0)) + else: + response_repeat = 2 + for _ in range(response_repeat): result = user_data._completed_requests.get() - output0 = result.as_numpy('OUTPUT0') + output0 = result.as_numpy("OUTPUT0") self.assertIsNotNone(output0) self.assertTrue(np.all(output0 == input0)) - else: - response_repeat = 2 - for _ in range(response_repeat): - result = user_data._completed_requests.get() - output0 = result.as_numpy('OUTPUT0') - self.assertIsNotNone(output0) - self.assertTrue(np.all(output0 == input0)) def test_ensemble_io(self): # Only run the shared memory leak detection with the default trial - if TRIAL == 'default': + if TRIAL == "default": with self._shm_leak_detector.Probe(): - self._run_test() + self._run_ensemble_test() else: - self._run_test() - + self._run_ensemble_test() + + def test_empty_gpu_output(self): + model_name = "dlpack_empty_output" + input_data = np.array([[1.0]], dtype=np.float32) + inputs = [ + grpcclient.InferInput( + "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + result = self._client.infer(model_name, inputs) + output = result.as_numpy("OUTPUT") + self.assertIsNotNone(output) + self.assertEqual(output.size, 0) + + def test_variable_gpu_output(self): + # Input is not important in this test + model_name = "variable_gpu_output" + input_data = np.array([[1.0]], dtype=np.float32) + inputs = [ + grpcclient.InferInput( + "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + user_data = UserData() -if __name__ == '__main__': + # The test sends five requests to the model and the model returns five + # responses with different GPU output shapes + num_requests = 5 + for _ in range(num_requests): + _ = self._client.async_infer( + model_name=model_name, + inputs=inputs, + callback=partial(callback, user_data), + ) + + for i in range(num_requests): + result = user_data._completed_requests.get() + if result is InferenceServerException: + self.assertTrue(False, result) + output = result.as_numpy("OUTPUT") + self.assertIsNotNone(output) + self.assertEqual(output.size, i + 1) + np.testing.assert_almost_equal(output, np.ones(i + 1) * (i + 1)) + + +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/io/test.sh b/qa/L0_backend_python/io/test.sh old mode 100644 new mode 100755 index eb642e6e4b..86827a4260 --- a/qa/L0_backend_python/io/test.sh +++ b/qa/L0_backend_python/io/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,7 +26,7 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. UNITTEST_PY=./io_test.py -CLIENT_LOG="./client.log" +CLIENT_LOG="./io_client.log" EXPECTED_NUM_TESTS="1" TEST_RESULT_FILE='test_results.txt' source ../common.sh @@ -37,14 +37,15 @@ SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./io_server.log" RET=0 rm -fr *.log ./models pip3 uninstall -y torch -pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html +pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html +# IOTest.test_ensemble_io TRIALS="default decoupled" for trial in $TRIALS; do @@ -82,26 +83,89 @@ for trial in $TRIALS; do fi set +e - python3 $UNITTEST_PY > $CLIENT_LOG + python3 $UNITTEST_PY IOTest.test_ensemble_io > $CLIENT_LOG.test_ensemble_io if [ $? -ne 0 ]; then - echo -e "\n***\n*** io_test.py FAILED. \n***" - cat $CLIENT_LOG + echo -e "\n***\n*** IOTest.test_ensemble_io FAILED. \n***" + cat $CLIENT_LOG.test_ensemble_io RET=1 else check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS if [ $? -ne 0 ]; then - cat $CLIENT_LOG + cat $CLIENT_LOG.test_ensemble_io echo -e "\n***\n*** Test Result Verification Failed\n***" RET=1 fi fi - set -e kill $SERVER_PID wait $SERVER_PID done +# IOTest.test_empty_gpu_output +rm -rf models && mkdir models +mkdir -p models/dlpack_empty_output/1/ +cp ../../python_models/dlpack_empty_output/model.py ./models/dlpack_empty_output/1/ +cp ../../python_models/dlpack_empty_output/config.pbtxt ./models/dlpack_empty_output/ + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + RET=1 +fi + +set +e +python3 $UNITTEST_PY IOTest.test_empty_gpu_output > $CLIENT_LOG.test_empty_gpu_output +if [ $? -ne 0 ]; then + echo -e "\n***\n*** IOTest.test_empty_gpu_output FAILED. \n***" + cat $CLIENT_LOG.test_empty_gpu_output + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG.test_empty_gpu_output + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# IOTest.test_variable_gpu_output +rm -rf models && mkdir models +mkdir -p models/variable_gpu_output/1/ +cp ../../python_models/variable_gpu_output/model.py ./models/variable_gpu_output/1/ +cp ../../python_models/variable_gpu_output/config.pbtxt ./models/variable_gpu_output/ + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + RET=1 +fi + +set +e +python3 $UNITTEST_PY IOTest.test_variable_gpu_output > $CLIENT_LOG.test_variable_gpu_output +if [ $? -ne 0 ]; then + echo -e "\n***\n*** IOTest.variable_gpu_output FAILED. \n***" + cat $CLIENT_LOG.test_variable_gpu_output + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG.test_variable_gpu_output + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + if [ $RET -eq 0 ]; then echo -e "\n***\n*** IO test PASSED.\n***" else diff --git a/qa/L0_backend_python/lifecycle/lifecycle_test.py b/qa/L0_backend_python/lifecycle/lifecycle_test.py old mode 100644 new mode 100755 index f9805d7984..82856bbd32 --- a/qa/L0_backend_python/lifecycle/lifecycle_test.py +++ b/qa/L0_backend_python/lifecycle/lifecycle_test.py @@ -1,4 +1,6 @@ -# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,21 +27,23 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../../common") -import test_util as tu -import shm_util +import queue +import time +import unittest from functools import partial -import tritonclient.http as httpclient + +import numpy as np +import shm_util +import test_util as tu import tritonclient.grpc as grpcclient +import tritonclient.http as httpclient from tritonclient.utils import * -import numpy as np -import unittest -import queue class UserData: - def __init__(self): self._completed_requests = queue.Queue() @@ -52,17 +56,87 @@ def callback(user_data, result, error): class LifecycleTest(tu.TestResultCollector): - def setUp(self): self._shm_leak_detector = shm_util.ShmLeakDetector() + def test_error_code(self): + model_name = "error_code" + shape = [1, 1] + # [(Triton error, expected gRPC error message starting), ...] + errors = [ + ("UNKNOWN", "[StatusCode.UNKNOWN]"), + ("INTERNAL", "[StatusCode.INTERNAL]"), + ("NOT_FOUND", "[StatusCode.NOT_FOUND]"), + ("INVALID_ARG", "[StatusCode.INVALID_ARGUMENT]"), + ("UNAVAILABLE", "[StatusCode.UNAVAILABLE]"), + ("UNSUPPORTED", "[StatusCode.UNIMPLEMENTED]"), + ("ALREADY_EXISTS", "[StatusCode.ALREADY_EXISTS]"), + ("CANCELLED", "[StatusCode.CANCELLED]"), + ("(default)", "[StatusCode.INTERNAL] unrecognized"), + ] + with self._shm_leak_detector.Probe() as shm_probe: + with grpcclient.InferenceServerClient("localhost:8001") as client: + for error, expected_grpc_error_start in errors: + input_data = np.array([[error]], dtype=np.object_) + inputs = [ + grpcclient.InferInput( + "ERROR_CODE", shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + with self.assertRaises(InferenceServerException) as e: + client.infer(model_name, inputs) + # e.g. [StatusCode.UNKNOWN] error code: TRITONSERVER_ERROR_UNKNOWN + # e.g. [StatusCode.INTERNAL] unrecognized error code: (default) + self.assertEqual( + str(e.exception), + expected_grpc_error_start + " error code: " + error, + ) + + def test_execute_cancel(self): + model_name = "execute_cancel" + log_path = "lifecycle_server.log" + execute_delay = 4.0 # seconds + shape = [1, 1] + response = {"responded": False, "result": None, "error": None} + + def callback(result, error): + response["responded"] = True + response["result"] = result + response["error"] = error + + with self._shm_leak_detector.Probe() as shm_probe: + with grpcclient.InferenceServerClient("localhost:8001") as client: + input_data = np.array([[execute_delay]], dtype=np.float32) + inputs = [ + grpcclient.InferInput( + "EXECUTE_DELAY", shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + exec_future = client.async_infer(model_name, inputs, callback) + time.sleep(2) # ensure the request is executing + self.assertFalse(response["responded"]) + exec_future.cancel() + time.sleep(2) # ensure the cancellation is delivered + self.assertTrue(response["responded"]) + + self.assertEqual(response["result"], None) + self.assertIsInstance(response["error"], InferenceServerException) + self.assertEqual(response["error"].status(), "StatusCode.CANCELLED") + with open(log_path, mode="r", encoding="utf-8", errors="strict") as f: + log_text = f.read() + self.assertIn("[execute_cancel] Request not cancelled at 1.0 s", log_text) + self.assertIn("[execute_cancel] Request cancelled at ", log_text) + def test_batch_error(self): - # The execute_error model returns an error for the first request and - # sucessfully processes the second request. This is making sure that - # an error in a single request does not completely fail the batch. + # The execute_error model returns an error for the first and third + # request and successfully processes the second request. This is making + # sure that an error in a single request does not completely fail the + # batch. model_name = "execute_error" shape = [2, 2] - number_of_requests = 2 + number_of_requests = 3 user_data = UserData() triton_client = grpcclient.InferenceServerClient("localhost:8001") triton_client.start_stream(callback=partial(callback, user_data)) @@ -73,16 +147,16 @@ def test_batch_error(self): input_data = np.random.randn(*shape).astype(np.float32) input_datas.append(input_data) inputs = [ - grpcclient.InferInput("IN", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + grpcclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) - triton_client.async_stream_infer(model_name=model_name, - inputs=inputs) + triton_client.async_stream_infer(model_name=model_name, inputs=inputs) for i in range(number_of_requests): result = user_data._completed_requests.get() - if i == 0: + if i == 0 or i == 2: self.assertIs(type(result), InferenceServerException) continue @@ -92,7 +166,9 @@ def test_batch_error(self): self.assertTrue( np.array_equal(output_data, input_datas[i]), "error: expected output {} to match input {}".format( - output_data, input_datas[i])) + output_data, input_datas[i] + ), + ) def test_infer_pymodel_error(self): model_name = "wrong_model" @@ -102,8 +178,9 @@ def test_infer_pymodel_error(self): with httpclient.InferenceServerClient("localhost:8000") as client: input_data = (16384 * np.random.randn(*shape)).astype(np.uint32) inputs = [ - httpclient.InferInput("IN", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + httpclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) try: @@ -113,21 +190,24 @@ def test_infer_pymodel_error(self): self.assertTrue( e.message().startswith( "Failed to process the request(s) for model instance" - ), "Exception message is not correct") + ), + "Exception message is not correct", + ) else: self.assertTrue( - False, - "Wrong exception raised or did not raise an exception") + False, "Wrong exception raised or did not raise an exception" + ) def test_incorrect_execute_return(self): - model_name = 'execute_return_error' + model_name = "execute_return_error" shape = [1, 1] with self._shm_leak_detector.Probe() as shm_probe: with httpclient.InferenceServerClient("localhost:8000") as client: input_data = (5 * np.random.randn(*shape)).astype(np.float32) inputs = [ - httpclient.InferInput("INPUT", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + httpclient.InferInput( + "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) @@ -136,10 +216,11 @@ def test_incorrect_execute_return(self): client.infer(model_name, inputs) self.assertTrue( - str(e.exception).startswith( - "Failed to process the request(s) for model instance " - "'execute_return_error_0', message: Expected a list in the " - "execute return"), "Exception message is not correct.") + "Failed to process the request(s) for model instance " + "'execute_return_error_0_0', message: Expected a list in the " + "execute return" in str(e.exception), + "Exception message is not correct.", + ) # The second inference request will return a list of None object # instead of Python InferenceResponse objects. @@ -147,12 +228,13 @@ def test_incorrect_execute_return(self): client.infer(model_name, inputs) self.assertTrue( - str(e.exception).startswith( - "Failed to process the request(s) for model instance " - "'execute_return_error_0', message: Expected an " - "'InferenceResponse' object in the execute function return" - " list"), "Exception message is not correct.") + "Failed to process the request(s) for model instance " + "'execute_return_error_0_0', message: Expected an " + "'InferenceResponse' object in the execute function return" + " list" in str(e.exception), + "Exception message is not correct.", + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/lifecycle/test.sh b/qa/L0_backend_python/lifecycle/test.sh old mode 100644 new mode 100755 index 9d7917b538..3d843ea874 --- a/qa/L0_backend_python/lifecycle/test.sh +++ b/qa/L0_backend_python/lifecycle/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,8 +25,8 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -CLIENT_LOG="./client.log" -EXPECTED_NUM_TESTS="3" +CLIENT_LOG="./lifecycle_client.log" +EXPECTED_NUM_TESTS="5" TEST_RESULT_FILE='test_results.txt' source ../common.sh source ../../common/util.sh @@ -35,11 +35,19 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./lifecycle_server.log" RET=0 rm -fr *.log ./models +mkdir -p models/error_code/1/ +cp ../../python_models/error_code/model.py ./models/error_code/1/ +cp ../../python_models/error_code/config.pbtxt ./models/error_code/ + +mkdir -p models/execute_cancel/1/ +cp ../../python_models/execute_cancel/model.py ./models/execute_cancel/1/ +cp ../../python_models/execute_cancel/config.pbtxt ./models/execute_cancel/ + mkdir -p models/execute_error/1/ cp ../../python_models/execute_error/model.py ./models/execute_error/1/ cp ../../python_models/execute_error/config.pbtxt ./models/execute_error/ @@ -72,7 +80,7 @@ set +e # Run this multiple times to catch any intermittent segfault. for i in {0..4}; do - python3 lifecycle_test.py > $CLIENT_LOG 2>&1 + python3 lifecycle_test.py > $CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** lifecycle_test.py FAILED. \n***" @@ -171,10 +179,6 @@ set -e rm -rf models/ mkdir -p models/auto_complete_error/1/ cp ../../python_models/auto_complete_error/model.py ./models/auto_complete_error/1/ -if [ "$TEST_JETSON" == "1" ]; then - echo -e 'name: "auto_complete_error" \ninstance_group [{ kind: KIND_CPU }]' > \ - models/auto_complete_error/config.pbtxt -fi SERVER_ARGS="${SERVER_ARGS} --strict-model-config=false" diff --git a/qa/L0_backend_python/logging/logging_test.py b/qa/L0_backend_python/logging/logging_test.py new file mode 100755 index 0000000000..b21919df65 --- /dev/null +++ b/qa/L0_backend_python/logging/logging_test.py @@ -0,0 +1,58 @@ +#!/usr/bin/env python3 + +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../../common") +import unittest + +import numpy as np +import test_util as tu +import tritonclient.http as httpclient +from tritonclient.utils import * + + +class LogTest(tu.TestResultCollector): + def test_log_output(self): + model_name = "identity_fp32_logging" + with httpclient.InferenceServerClient("localhost:8000") as client: + input_data = np.array([[1.0]], dtype=np.float32) + inputs = [ + httpclient.InferInput( + "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + result = client.infer(model_name, inputs) + output0 = result.as_numpy("OUTPUT0") + self.assertIsNotNone(output0) + self.assertTrue(np.all(output0 == input_data)) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_backend_python/logging/test.sh b/qa/L0_backend_python/logging/test.sh new file mode 100755 index 0000000000..b665ead7dd --- /dev/null +++ b/qa/L0_backend_python/logging/test.sh @@ -0,0 +1,231 @@ +#!/bin/bash +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +CLIENT_LOG="logging_client.log" +TEST_RESULT_FILE="test_results.txt" +LOG_TEST="logging_test.py" +SERVER_LOG="./logging_server.log" + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +export CUDA_VISIBLE_DEVICES=0 + +# On windows the paths invoked by the script (running in WSL) must use +# /mnt/c when needed but the paths on the tritonserver command-line +# must be C:/ style. +if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then + MODELDIR=${MODELDIR:=C:/models} + DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"} + BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends} + SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe} + export WSLENV=$WSLENV:TRITONSERVER_DELAY_SCHEDULER +else + MODELDIR=${MODELDIR:=`pwd`} + DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"} + TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} + SERVER=${TRITON_DIR}/bin/tritonserver + BACKEND_DIR=${TRITON_DIR}/backends +fi + +MODELSDIR=`pwd`/models +source ../../common/util.sh + +function verify_log_counts () { + non_verbose_expected=$1 + verbose_expected=$2 + + if [ `grep -c "Specific Msg!" $SERVER_LOG` != $non_verbose_expected ]; then + echo -e "\n***\n*** Test Failed: Specific Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Info Msg!" $SERVER_LOG` != $non_verbose_expected ]; then + echo -e "\n***\n*** Test Failed: Info Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Warning Msg!" $SERVER_LOG` != $non_verbose_expected ]; then + echo -e "\n***\n*** Test Failed: Warning Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Error Msg!" $SERVER_LOG` != $non_verbose_expected ]; then + echo -e "\n***\n*** Test Failed: Error Msg Count Incorrect\n***" + RET=1 + fi + if [ `grep -c "Verbose Msg!" $SERVER_LOG` != $verbose_expected ]; then + echo -e "\n***\n*** Test Failed: Verbose Msg Count Incorrect\n***" + RET=1 + fi +} + +rm -f *.log + +# set up simple repository MODELBASE +rm -fr $MODELSDIR && mkdir -p $MODELSDIR && \ + python_model="identity_fp32_logging" + mkdir -p models/$python_model/1/ + cp ../../python_models/$python_model/config.pbtxt models/$python_model/config.pbtxt + cp ../../python_models/$python_model/model.py models/$python_model/1/ +RET=0 + +#Run Server with Default Log Settings +SERVER_ARGS="--model-repository=$MODELSDIR --backend-directory=${BACKEND_DIR}" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python3 $LOG_TEST >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Test Failed\n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# Check if correct # log messages are present [ non-verbose-msg-cnt | verbose-msg-cnt ] +verify_log_counts 4 0 + +rm -f *.log +#Run Server Enabling Verbose Messages +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +# Enable verbose logging +code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":1}' localhost:8000/v2/logging` + +if [ "$code" != "200" ]; then + cat ./curl.out + echo -e "\n***\n*** Test Failed: Could not Change Log Settings\n***" + RET=1 +fi + +python3 $LOG_TEST >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Test Failed\n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# Verbose only 3 because model must initialize before +# log settings can be modified +verify_log_counts 4 3 + +rm -f *.log +#Run Server Enabling Verbose Messages +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +# Disable all logging +BOOL_PARAMS=${BOOL_PARAMS:="log_info log_warning log_error"} +for BOOL_PARAM in $BOOL_PARAMS; do + # Attempt to use integer instead of bool + code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":false}' localhost:8000/v2/logging` + if [ "$code" != "200" ]; then + cat ./curl.out + echo -e "\n***\n*** Test Failed: Could not Change Log Settings\n***" + RET=1 + fi +done + +python3 $LOG_TEST >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Test Failed\n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# Will have 1 occurrence of each non-verbose log type +# because the server must initialize before log settings +# can be modified +verify_log_counts 1 0 + + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Logging test PASSED. \n***" +else + echo -e "\n***\n*** Logging test FAILED. \n***" +fi + +exit $RET diff --git a/qa/L0_backend_python/model_control/model_control_test.py b/qa/L0_backend_python/model_control/model_control_test.py old mode 100644 new mode 100755 index feceda01e4..17686f97d5 --- a/qa/L0_backend_python/model_control/model_control_test.py +++ b/qa/L0_backend_python/model_control/model_control_test.py @@ -1,4 +1,6 @@ -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,22 +30,22 @@ sys.path.append("../../common") +import unittest + +import numpy as np +import shm_util import test_util as tu import tritonclient.http as httpclient from tritonclient.utils import * -import numpy as np -import unittest -import shm_util class ExplicitModelTest(tu.TestResultCollector): - def setUp(self): self._shm_leak_detector = shm_util.ShmLeakDetector() def send_identity_request(self, client, model_name): inputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "FP32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32")) input0_data = np.arange(start=0, stop=16, dtype=np.float32) input0_data = np.expand_dims(input0_data, axis=0) inputs[0].set_data_from_numpy(input0_data) @@ -52,13 +54,14 @@ def send_identity_request(self, client, model_name): result = client.infer( model_name=model_name, inputs=inputs, - outputs=[httpclient.InferRequestedOutput('OUTPUT0')]) - output_numpy = result.as_numpy('OUTPUT0') + outputs=[httpclient.InferRequestedOutput("OUTPUT0")], + ) + output_numpy = result.as_numpy("OUTPUT0") self.assertTrue(np.all(input0_data == output_numpy)) def test_model_reload(self): model_name = "identity_fp32" - ensemble_model_name = 'simple_' + "identity_fp32" + ensemble_model_name = "simple_" + "identity_fp32" with httpclient.InferenceServerClient("localhost:8000") as client: for _ in range(5): self.assertFalse(client.is_model_ready(model_name)) @@ -76,5 +79,5 @@ def test_model_reload(self): self.assertFalse(client.is_model_ready(ensemble_model_name)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/model_control/test.sh b/qa/L0_backend_python/model_control/test.sh old mode 100644 new mode 100755 index 63fabd8bd2..c4709ce217 --- a/qa/L0_backend_python/model_control/test.sh +++ b/qa/L0_backend_python/model_control/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,14 +25,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -CLIENT_LOG="./client.log" +CLIENT_LOG="./model_control_client.log" EXPECTED_NUM_TESTS="1" TEST_RESULT_FILE='test_results.txt' TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./model_control_server.log" RET=0 rm -fr *.log ./models @@ -77,3 +77,5 @@ if [ $RET -eq 1 ]; then else echo -e "\n***\n*** model_control_test PASSED. \n***" fi + +exit $RET diff --git a/qa/L0_backend_python/python_based_backends/python_based_backends_test.py b/qa/L0_backend_python/python_based_backends/python_based_backends_test.py new file mode 100644 index 0000000000..13fe204267 --- /dev/null +++ b/qa/L0_backend_python/python_based_backends/python_based_backends_test.py @@ -0,0 +1,144 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys +import unittest +from random import randint + +import numpy as np +import tritonclient.grpc as grpcclient +from tritonclient.utils import * + +sys.path.append("../../common") +from test_util import TestResultCollector + + +class PythonBasedBackendsTest(TestResultCollector): + def setUp(self): + self.triton_client = grpcclient.InferenceServerClient(url="localhost:8001") + self.add_sub_model_1 = "add" + self.add_sub_model_2 = "sub" + self.python_model = "add_sub" + self.pytorch_model = "add_sub_pytorch" + + self.triton_client.load_model( + self.add_sub_model_1, + config='{"backend":"add_sub","version_policy":{"latest":{"num_versions":2}}}', + ) + self.triton_client.load_model(self.add_sub_model_2) + self.triton_client.load_model(self.python_model) + self.triton_client.load_model(self.pytorch_model) + + def test_add_sub_models(self): + self.assertTrue( + self.triton_client.is_model_ready(self.add_sub_model_1, model_version="2") + ) + self._test_add_sub_model( + model_name=self.add_sub_model_1, model_version="2", single_output=True + ) + + self.assertTrue( + self.triton_client.is_model_ready(self.add_sub_model_1, model_version="1") + ) + self._test_add_sub_model( + model_name=self.add_sub_model_1, model_version="1", single_output=True + ) + + self.assertTrue(self.triton_client.is_model_ready(self.add_sub_model_2)) + self._test_add_sub_model(model_name=self.add_sub_model_2, single_output=True) + + def test_python_model(self): + self.assertTrue( + self.triton_client.is_model_ready(self.python_model, model_version="2") + ) + self._test_add_sub_model( + model_name=self.python_model, shape=[16], model_version="2" + ) + + def test_pytorch_model(self): + self.assertTrue( + self.triton_client.is_model_ready(self.pytorch_model, model_version="1") + ) + self._test_add_sub_model(model_name=self.pytorch_model) + + def _test_add_sub_model( + self, model_name, model_version="1", shape=[4], single_output=False + ): + input0_data = np.random.rand(*shape).astype(np.float32) + input1_data = np.random.rand(*shape).astype(np.float32) + + inputs = [ + grpcclient.InferInput( + "INPUT0", input0_data.shape, np_to_triton_dtype(input0_data.dtype) + ), + grpcclient.InferInput( + "INPUT1", input1_data.shape, np_to_triton_dtype(input1_data.dtype) + ), + ] + + inputs[0].set_data_from_numpy(input0_data) + inputs[1].set_data_from_numpy(input1_data) + + if single_output: + outputs = [grpcclient.InferRequestedOutput("OUTPUT")] + + else: + outputs = [ + grpcclient.InferRequestedOutput("OUTPUT0"), + grpcclient.InferRequestedOutput("OUTPUT1"), + ] + + response = self.triton_client.infer( + model_name=model_name, + inputs=inputs, + model_version=model_version, + request_id=str(randint(10, 99)), + outputs=outputs, + ) + + if single_output: + if model_name == "add": + self.assertTrue( + np.allclose(input0_data + input1_data, response.as_numpy("OUTPUT")) + ) + else: + self.assertTrue( + np.allclose(input0_data - input1_data, response.as_numpy("OUTPUT")) + ) + else: + self.assertTrue( + np.allclose(input0_data + input1_data, response.as_numpy("OUTPUT0")) + ) + self.assertTrue( + np.allclose(input0_data - input1_data, response.as_numpy("OUTPUT1")) + ) + + def tearDown(self): + self.triton_client.close() + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_backend_python/python_based_backends/test.sh b/qa/L0_backend_python/python_based_backends/test.sh new file mode 100755 index 0000000000..0f332eb3e0 --- /dev/null +++ b/qa/L0_backend_python/python_based_backends/test.sh @@ -0,0 +1,113 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +source ../../common/util.sh + +TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} +SERVER=${TRITON_DIR}/bin/tritonserver +BACKEND_DIR=${TRITON_DIR}/backends +QA_MODELS_PATH="../../python_models" +MODEL_REPOSITORY="$(pwd)/models" +SERVER_ARGS="--model-repository=${MODEL_REPOSITORY} --backend-directory=${BACKEND_DIR} --model-control-mode=explicit --log-verbose=1" +SERVER_LOG="./python_based_backends_server.log" +CLIENT_LOG="./python_based_backends_client.log" +TEST_RESULT_FILE="./test_results.txt" +CLIENT_PY="./python_based_backends_test.py" +GEN_PYTORCH_MODEL_PY="../../common/gen_qa_pytorch_model.py" +EXPECTED_NUM_TESTS=3 +RET=0 + +rm -rf ${MODEL_REPOSITORY} +pip3 install torch + +# Setup add_sub backend and models +mkdir -p ${BACKEND_DIR}/add_sub +cp ${QA_MODELS_PATH}/python_based_backends/add_sub_backend/model.py ${BACKEND_DIR}/add_sub/model.py + +mkdir -p ${MODEL_REPOSITORY}/add/1/ +echo '{ "operation": "add" }' > ${MODEL_REPOSITORY}/add/1/model.json +echo "backend: \"add_sub\"" > ${MODEL_REPOSITORY}/add/config.pbtxt +cp -r ${MODEL_REPOSITORY}/add/1/ ${MODEL_REPOSITORY}/add/2/ + +mkdir -p ${MODEL_REPOSITORY}/sub/1/ +echo '{ "operation": "sub" }' > ${MODEL_REPOSITORY}/sub/1/model.json +echo "backend: \"add_sub\"" > ${MODEL_REPOSITORY}/sub/config.pbtxt + +# Setup python backend model +mkdir -p ${MODEL_REPOSITORY}/add_sub/1 +cp ${QA_MODELS_PATH}/add_sub/model.py ${MODEL_REPOSITORY}/add_sub/1/ +cp ${QA_MODELS_PATH}/add_sub/config.pbtxt ${MODEL_REPOSITORY}/add_sub/ +cp -r ${MODEL_REPOSITORY}/add_sub/1/ ${MODEL_REPOSITORY}/add_sub/2/ + +# Setup pytorch backend model +cp ${GEN_PYTORCH_MODEL_PY} ./gen_qa_pytorch_model.py +GEN_PYTORCH_MODEL_PY=./gen_qa_pytorch_model.py + +set +e +python3 ${GEN_PYTORCH_MODEL_PY} -m ${MODEL_REPOSITORY} + +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Running ${GEN_PYTORCH_MODEL_PY} FAILED. \n***" + exit 1 +fi +set -e + +run_server +if [ "$SERVER_PID" == "0" ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Failed to start $SERVER\n***" + exit 1 +fi + +set +e +python3 $CLIENT_PY -v >$CLIENT_LOG 2>&1 + +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Running $CLIENT_PY FAILED. \n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Test Result Verification FAILED.\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID +rm -rf ${MODEL_REPOSITORY} ${GEN_PYTORCH_MODEL_PY} + +if [ $RET -eq 1 ]; then + cat $CLIENT_LOG + cat $SERVER_LOG + echo -e "\n***\n*** Python-based Backends test FAILED. \n***" +else + echo -e "\n***\n*** Python-based Backends test PASSED. \n***" +fi + +exit $RET diff --git a/qa/L0_backend_python/python_test.py b/qa/L0_backend_python/python_test.py old mode 100644 new mode 100755 index 3c5d520775..eb4d02aa53 --- a/qa/L0_backend_python/python_test.py +++ b/qa/L0_backend_python/python_test.py @@ -1,6 +1,6 @@ #!/usr/bin/python -# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -30,21 +30,20 @@ sys.path.append("../common") +import os import unittest + import numpy as np -import test_util as tu -import shm_util import requests as httpreq -import os - -from tritonclient.utils import * +import shm_util +import test_util as tu import tritonclient.http as httpclient +from tritonclient.utils import * -TEST_JETSON = bool(int(os.environ.get('TEST_JETSON', 0))) +TEST_JETSON = bool(int(os.environ.get("TEST_JETSON", 0))) class PythonTest(tu.TestResultCollector): - def setUp(self): self._shm_leak_detector = shm_util.ShmLeakDetector() @@ -52,33 +51,39 @@ def _infer_help(self, model_name, shape, data_type): with httpclient.InferenceServerClient("localhost:8000") as client: input_data_0 = np.array(np.random.randn(*shape), dtype=data_type) inputs = [ - httpclient.InferInput("INPUT0", shape, - np_to_triton_dtype(input_data_0.dtype)) + httpclient.InferInput( + "INPUT0", shape, np_to_triton_dtype(input_data_0.dtype) + ) ] inputs[0].set_data_from_numpy(input_data_0) result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') + output0 = result.as_numpy("OUTPUT0") self.assertTrue(np.all(input_data_0 == output0)) + def _create_cuda_region(self, client, size, name): + import tritonclient.utils.cuda_shared_memory as cuda_shared_memory + + shm0_handle = cuda_shared_memory.create_shared_memory_region( + name, byte_size=size, device_id=0 + ) + client.register_cuda_shared_memory( + name, cuda_shared_memory.get_raw_handle(shm0_handle), 0, size + ) + return shm0_handle + def _optional_input_infer(self, model_name, has_input0, has_input1): with httpclient.InferenceServerClient("localhost:8000") as client: shape = (1,) if has_input0: - input0_numpy = np.random.randint(0, - 100, - size=shape, - dtype=np.int32) + input0_numpy = np.random.randint(0, 100, size=shape, dtype=np.int32) else: # Set the input0 to a default value if it is optional. This is # the input used by the model if it is not provided. input0_numpy = np.array([5], dtype=np.int32) if has_input1: - input1_numpy = np.random.randint(0, - 100, - size=shape, - dtype=np.int32) + input1_numpy = np.random.randint(0, 100, size=shape, dtype=np.int32) else: # Set the input1 to a default value if it is optional. This is # the input used by the model if it is not provided. @@ -88,68 +93,136 @@ def _optional_input_infer(self, model_name, has_input0, has_input1): if has_input0: inputs.append( httpclient.InferInput( - "INPUT0", shape, - np_to_triton_dtype(input0_numpy.dtype))) + "INPUT0", shape, np_to_triton_dtype(input0_numpy.dtype) + ) + ) inputs[-1].set_data_from_numpy(input0_numpy) if has_input1: inputs.append( httpclient.InferInput( - "INPUT1", shape, - np_to_triton_dtype(input1_numpy.dtype))) + "INPUT1", shape, np_to_triton_dtype(input1_numpy.dtype) + ) + ) inputs[-1].set_data_from_numpy(input1_numpy) result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') + output0 = result.as_numpy("OUTPUT0") self.assertIsNotNone(output0, "OUTPUT0 was not found.") - output1 = result.as_numpy('OUTPUT1') + output1 = result.as_numpy("OUTPUT1") self.assertIsNotNone(output1, "OUTPUT1 was not found.") expected_output0 = input0_numpy + input1_numpy expected_output1 = input0_numpy - input1_numpy - np.testing.assert_equal(output0, expected_output0, - "OUTPUT0 doesn't match expected OUTPUT0") - np.testing.assert_equal(output1, expected_output1, - "OUTPUT1 doesn't match expected OUTPUT1") - - # We do not use a docker on Jetson so it does not impose a shared memory - # allocation limit of 1GB. This means test will pass without the expected - # error on jetson and is hence unnecessary. + np.testing.assert_equal( + output0, expected_output0, "OUTPUT0 doesn't match expected OUTPUT0" + ) + np.testing.assert_equal( + output1, expected_output1, "OUTPUT1 doesn't match expected OUTPUT1" + ) + + def test_growth_error(self): + # 2 MiBs + total_byte_size = 2 * 1024 * 1024 + shape = [total_byte_size] + model_name = "identity_uint8_nobatch" + dtype = np.uint8 + with self._shm_leak_detector.Probe() as shm_probe: + self._infer_help(model_name, shape, dtype) + + # 1 GiB payload leads to error in the main Python backend process. + # Total shared memory available is 1GiB. + total_byte_size = 1024 * 1024 * 1024 + shape = [total_byte_size] + with self.assertRaises(InferenceServerException) as ex: + self._infer_help(model_name, shape, dtype) + self.assertIn( + "Failed to increase the shared memory pool size", str(ex.exception) + ) + + # 512 MiBs payload leads to error in the Python stub process. + total_byte_size = 512 * 1024 * 1024 + shape = [total_byte_size] + with self.assertRaises(InferenceServerException) as ex: + self._infer_help(model_name, shape, dtype) + self.assertIn( + "Failed to increase the shared memory pool size", str(ex.exception) + ) + + # 2 MiBs + # Send a small paylaod to make sure it is still working properly + total_byte_size = 2 * 1024 * 1024 + shape = [total_byte_size] + with self._shm_leak_detector.Probe() as shm_probe: + self._infer_help(model_name, shape, dtype) + + # GPU tensors are not supported on jetson + # CUDA Shared memory is not supported on jetson if not TEST_JETSON: - def test_growth_error(self): - # 2 MiBs - total_byte_size = 2 * 1024 * 1024 - shape = [total_byte_size] - model_name = 'identity_uint8_nobatch' - dtype = np.uint8 - with self._shm_leak_detector.Probe() as shm_probe: - self._infer_help(model_name, shape, dtype) - - # 1 GiB payload leads to error in the main Python backned process. - # Total shared memory available is 1GiB. - total_byte_size = 1024 * 1024 * 1024 - shape = [total_byte_size] - with self.assertRaises(InferenceServerException) as ex: - self._infer_help(model_name, shape, dtype) - self.assertIn("Failed to increase the shared memory pool size", - str(ex.exception)) - - # 512 MiBs payload leads to error in the Python stub process. - total_byte_size = 512 * 1024 * 1024 - shape = [total_byte_size] - with self.assertRaises(InferenceServerException) as ex: - self._infer_help(model_name, shape, dtype) - self.assertIn("Failed to increase the shared memory pool size", - str(ex.exception)) - - # 2 MiBs - # Send a small paylaod to make sure it is still working properly - total_byte_size = 2 * 1024 * 1024 - shape = [total_byte_size] - with self._shm_leak_detector.Probe() as shm_probe: - self._infer_help(model_name, shape, dtype) + def test_gpu_tensor_error(self): + import tritonclient.utils.cuda_shared_memory as cuda_shared_memory + + model_name = "identity_bool" + with httpclient.InferenceServerClient("localhost:8000") as client: + input_data = np.array([[True] * 1000], dtype=bool) + inputs = [ + httpclient.InferInput( + "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) + ] + inputs[0].set_data_from_numpy(input_data) + + requested_outputs = [httpclient.InferRequestedOutput("OUTPUT0")] + + # intentionally create a shared memory region with not enough size. + client.unregister_cuda_shared_memory() + shm0_handle = self._create_cuda_region(client, 1, "output0_data") + + requested_outputs[0].set_shared_memory("output0_data", 1) + with self.assertRaises(InferenceServerException) as ex: + client.infer(model_name, inputs, outputs=requested_outputs) + self.assertIn( + "should be at least 1000 bytes to hold the results", + str(ex.exception), + ) + client.unregister_cuda_shared_memory() + cuda_shared_memory.destroy_shared_memory_region(shm0_handle) + + def test_dlpack_tensor_error(self): + import tritonclient.utils.cuda_shared_memory as cuda_shared_memory + + model_name = "dlpack_identity" + with httpclient.InferenceServerClient("localhost:8000") as client: + input_data = np.array([[1] * 1000], dtype=np.float32) + inputs = [ + httpclient.InferInput( + "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) + ] + + requested_outputs = [httpclient.InferRequestedOutput("OUTPUT0")] + input_data_size = input_data.itemsize * input_data.size + client.unregister_cuda_shared_memory() + input_region = self._create_cuda_region( + client, input_data_size, "input0_data" + ) + inputs[0].set_shared_memory("input0_data", input_data_size) + cuda_shared_memory.set_shared_memory_region(input_region, [input_data]) + + # Intentionally create a small region to trigger an error + shm0_handle = self._create_cuda_region(client, 1, "output0_data") + requested_outputs[0].set_shared_memory("output0_data", 1) + + with self.assertRaises(InferenceServerException) as ex: + client.infer(model_name, inputs, outputs=requested_outputs) + self.assertIn( + "should be at least 4000 bytes to hold the results", + str(ex.exception), + ) + client.unregister_cuda_shared_memory() + cuda_shared_memory.destroy_shared_memory_region(shm0_handle) def test_async_infer(self): model_name = "identity_uint8" @@ -158,18 +231,19 @@ def test_async_infer(self): with self._shm_leak_detector.Probe() as shm_probe: with httpclient.InferenceServerClient( - "localhost:8000", - concurrency=request_parallelism) as client: + "localhost:8000", concurrency=request_parallelism + ) as client: input_datas = [] requests = [] for i in range(request_parallelism): - input_data = (16384 * np.random.randn(*shape)).astype( - np.uint8) + input_data = (16384 * np.random.randn(*shape)).astype(np.uint8) input_datas.append(input_data) inputs = [ httpclient.InferInput( - "INPUT0", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + "INPUT0", + input_data.shape, + np_to_triton_dtype(input_data.dtype), + ) ] inputs[0].set_data_from_numpy(input_data) requests.append(client.async_infer(model_name, inputs)) @@ -180,76 +254,92 @@ def test_async_infer(self): results = requests[i].get_result() output_data = results.as_numpy("OUTPUT0") - self.assertIsNotNone(output_data, - "error: expected 'OUTPUT0'") + self.assertIsNotNone(output_data, "error: expected 'OUTPUT0'") self.assertTrue( np.array_equal(output_data, input_datas[i]), "error: expected output {} to match input {}".format( - output_data, input_datas[i])) + output_data, input_datas[i] + ), + ) # Make sure the requests ran in parallel. stats = client.get_inference_statistics(model_name) - test_cond = (len(stats['model_stats']) != 1) or ( - stats['model_stats'][0]['name'] != model_name) + test_cond = (len(stats["model_stats"]) != 1) or ( + stats["model_stats"][0]["name"] != model_name + ) self.assertFalse( - test_cond, - "error: expected statistics for {}".format(model_name)) - - stat = stats['model_stats'][0] - self.assertFalse((stat['inference_count'] != 8) or ( - stat['execution_count'] != 1 - ), "error: expected execution_count == 1 and inference_count == 8, got {} and {}" - .format(stat['execution_count'], - stat['inference_count'])) - batch_stat = stat['batch_stats'][0] + test_cond, "error: expected statistics for {}".format(model_name) + ) + + stat = stats["model_stats"][0] self.assertFalse( - batch_stat['batch_size'] != 8, - f"error: expected batch_size == 8, got {batch_stat['batch_size']}" + (stat["inference_count"] != 8) or (stat["execution_count"] != 1), + "error: expected execution_count == 1 and inference_count == 8, got {} and {}".format( + stat["execution_count"], stat["inference_count"] + ), + ) + batch_stat = stat["batch_stats"][0] + self.assertFalse( + batch_stat["batch_size"] != 8, + f"error: expected batch_size == 8, got {batch_stat['batch_size']}", ) # Check metrics to make sure they are reported correctly - metrics = httpreq.get('http://localhost:8002/metrics') + metrics = httpreq.get("http://localhost:8002/metrics") print(metrics.text) - success_str = 'nv_inference_request_success{model="identity_uint8",version="1"}' - infer_count_str = 'nv_inference_count{model="identity_uint8",version="1"}' - infer_exec_str = 'nv_inference_exec_count{model="identity_uint8",version="1"}' + success_str = ( + 'nv_inference_request_success{model="identity_uint8",version="1"}' + ) + infer_count_str = ( + 'nv_inference_count{model="identity_uint8",version="1"}' + ) + infer_exec_str = ( + 'nv_inference_exec_count{model="identity_uint8",version="1"}' + ) success_val = None infer_count_val = None infer_exec_val = None for line in metrics.text.splitlines(): if line.startswith(success_str): - success_val = float(line[len(success_str):]) + success_val = float(line[len(success_str) :]) if line.startswith(infer_count_str): - infer_count_val = float(line[len(infer_count_str):]) + infer_count_val = float(line[len(infer_count_str) :]) if line.startswith(infer_exec_str): - infer_exec_val = float(line[len(infer_exec_str):]) + infer_exec_val = float(line[len(infer_exec_str) :]) self.assertFalse( success_val != 4, "error: expected metric {} == 4, got {}".format( - success_str, success_val)) + success_str, success_val + ), + ) self.assertFalse( infer_count_val != 8, "error: expected metric {} == 8, got {}".format( - infer_count_str, infer_count_val)) + infer_count_str, infer_count_val + ), + ) self.assertFalse( infer_exec_val != 1, "error: expected metric {} == 1, got {}".format( - infer_exec_str, infer_exec_val)) + infer_exec_str, infer_exec_val + ), + ) def test_bool(self): - model_name = 'identity_bool' + model_name = "identity_bool" with self._shm_leak_detector.Probe() as shm_probe: with httpclient.InferenceServerClient("localhost:8000") as client: input_data = np.array([[True, False, True]], dtype=bool) inputs = [ - httpclient.InferInput("INPUT0", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + httpclient.InferInput( + "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') + output0 = result.as_numpy("OUTPUT0") self.assertIsNotNone(output0) self.assertTrue(np.all(output0 == input_data)) @@ -260,21 +350,32 @@ def test_infer_pytorch(self): with httpclient.InferenceServerClient("localhost:8000") as client: input_data = np.zeros(shape, dtype=np.float32) inputs = [ - httpclient.InferInput("IN", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + httpclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) result = client.infer(model_name, inputs) - output_data = result.as_numpy('OUT') + output_data = result.as_numpy("OUT") self.assertIsNotNone(output_data, "error: expected 'OUT'") - # expected inference resposne from a zero tensor + # expected inference response from a zero tensor expected_result = [ - -2.2377274, -2.3976364, -2.2464046, -2.2790744, -2.3828976, - -2.2940576, -2.2928185, -2.340665, -2.275219, -2.292135 + -2.2377274, + -2.3976364, + -2.2464046, + -2.2790744, + -2.3828976, + -2.2940576, + -2.2928185, + -2.340665, + -2.275219, + -2.292135, ] - self.assertTrue(np.allclose(output_data[0], expected_result), - 'Inference result is not correct') + self.assertTrue( + np.allclose(output_data[0], expected_result), + "Inference result is not correct", + ) def test_init_args(self): model_name = "init_args" @@ -283,35 +384,39 @@ def test_init_args(self): with httpclient.InferenceServerClient("localhost:8000") as client: input_data = np.zeros(shape, dtype=np.float32) inputs = [ - httpclient.InferInput("IN", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + httpclient.InferInput( + "IN", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) result = client.infer(model_name, inputs) - # output respone in this model is the number of keys in the args + # output response in this model is the number of keys in the args self.assertTrue( result.as_numpy("OUT") == 7, - "Number of keys in the init args is not correct") + "Number of keys in the init args is not correct", + ) def test_unicode(self): model_name = "string" shape = [1] - for i in range(3): + # The first run will use np.bytes_ and the second run will use + # np.object_ + for i in range(2): with self._shm_leak_detector.Probe() as shm_probe: - with httpclient.InferenceServerClient( - "localhost:8000") as client: - utf8 = '😀' - input_data = np.array([bytes(utf8, encoding='utf-8')], - dtype=np.bytes_) + with httpclient.InferenceServerClient("localhost:8000") as client: + utf8 = "😀" + input_data = np.array( + [bytes(utf8, encoding="utf-8")], dtype=np.bytes_ + ) inputs = [ httpclient.InferInput( - "INPUT0", shape, - np_to_triton_dtype(input_data.dtype)) + "INPUT0", shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') + output0 = result.as_numpy("OUTPUT0") self.assertIsNotNone(output0) self.assertEqual(output0[0], input_data) @@ -321,36 +426,36 @@ def test_optional_input(self): with self._shm_leak_detector.Probe() as shm_probe: for has_input0 in [True, False]: for has_input1 in [True, False]: - self._optional_input_infer(model_name, has_input0, - has_input1) + self._optional_input_infer(model_name, has_input0, has_input1) def test_string(self): model_name = "string_fixed" shape = [1] - for i in range(6): + # Test different string outputs. This test will send 4 requests to the + # backend. The model will return 4 responses (np.object_ and np.bytes) * + # (empty output and fixed output) + for i in range(4): with self._shm_leak_detector.Probe() as shm_probe: - with httpclient.InferenceServerClient( - "localhost:8000") as client: - input_data = np.array(['123456'], dtype=np.object_) + with httpclient.InferenceServerClient("localhost:8000") as client: + input_data = np.array(["123456"], dtype=np.object_) inputs = [ httpclient.InferInput( - "INPUT0", shape, - np_to_triton_dtype(input_data.dtype)) + "INPUT0", shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') + output0 = result.as_numpy("OUTPUT0") self.assertIsNotNone(output0) if i % 2 == 0: - self.assertEqual(output0[0], - input_data.astype(np.bytes_)) + self.assertEqual(output0[0], input_data.astype(np.bytes_)) else: self.assertEqual(output0.size, 0) def test_non_contiguous(self): - model_name = 'non_contiguous' + model_name = "non_contiguous" shape = [2, 10, 11, 6, 5] new_shape = [10, 2, 6, 5, 11] shape_reorder = [1, 0, 4, 2, 3] @@ -358,8 +463,9 @@ def test_non_contiguous(self): input_numpy = np.random.rand(*shape) input_numpy = input_numpy.astype(np.float32) inputs = [ - httpclient.InferInput("INPUT0", shape, - np_to_triton_dtype(input_numpy.dtype)) + httpclient.InferInput( + "INPUT0", shape, np_to_triton_dtype(input_numpy.dtype) + ) ] inputs[0].set_data_from_numpy(input_numpy) result = client.infer(model_name, inputs) @@ -369,10 +475,10 @@ def test_non_contiguous(self): output1 = input_numpy.T output2 = np.transpose(input_numpy, shape_reorder) - self.assertTrue(np.all(output0 == result.as_numpy('OUTPUT0'))) - self.assertTrue(np.all(output1 == result.as_numpy('OUTPUT1'))) - self.assertTrue(np.all(output2 == result.as_numpy('OUTPUT2'))) + self.assertTrue(np.all(output0 == result.as_numpy("OUTPUT0"))) + self.assertTrue(np.all(output1 == result.as_numpy("OUTPUT1"))) + self.assertTrue(np.all(output2 == result.as_numpy("OUTPUT2"))) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/python_unittest.py b/qa/L0_backend_python/python_unittest.py old mode 100644 new mode 100755 index c29e2d80dd..c956412f9d --- a/qa/L0_backend_python/python_unittest.py +++ b/qa/L0_backend_python/python_unittest.py @@ -1,4 +1,6 @@ -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,46 +30,59 @@ sys.path.append("../../common") -import test_util as tu -import shm_util +import os import unittest + +import shm_util +import test_util as tu import tritonclient.grpc as grpcclient from tritonclient.utils import * -import os class PythonUnittest(tu.TestResultCollector): - def setUp(self): self._shm_leak_detector = shm_util.ShmLeakDetector() def _run_unittest(self, model_name): with grpcclient.InferenceServerClient("localhost:8001") as client: # No input is required - result = client.infer(model_name, [], client_timeout=120) - output0 = result.as_numpy('OUTPUT0') + result = client.infer(model_name, [], client_timeout=240) + output0 = result.as_numpy("OUTPUT0") - # The model returns 1 if the tests were sucessfully passed. + # The model returns 1 if the tests were successfully passed. # Otherwise, it will return 0. self.assertEqual(output0, [1]) def test_python_unittest(self): - model_name = os.environ['MODEL_NAME'] - - if model_name == 'bls' or model_name == 'bls_memory' or model_name == 'bls_memory_async': - # For these tests, the memory region size will be grown. Because of - # this we need to use the shared memory probe only on the later - # call so that the probe can detect the leak correctly. - self._run_unittest(model_name) + model_name = os.environ["MODEL_NAME"] + bls_kind = os.environ.get("BLS_KIND", "non_decoupled") - # [FIXME] See DLIS-3684 + if bls_kind == "decoupled": + # Skip the shared memory probe for decoupled models for now as + # there are some small changes in the shared memory usage when + # running decoupled inferences. Confirmed that the memory growth + # is bounded. self._run_unittest(model_name) - with self._shm_leak_detector.Probe() as shm_probe: - self._run_unittest(model_name) else: - with self._shm_leak_detector.Probe() as shm_probe: + if ( + model_name == "bls" + or model_name == "bls_memory" + or model_name == "bls_memory_async" + or model_name == "bls_request_rescheduling" + ): + # For these tests, the memory region size will be grown. Because of + # this we need to use the shared memory probe only on the later + # call so that the probe can detect the leak correctly. + self._run_unittest(model_name) + + # [FIXME] See DLIS-3684 self._run_unittest(model_name) + with self._shm_leak_detector.Probe() as shm_probe: + self._run_unittest(model_name) + else: + with self._shm_leak_detector.Probe() as shm_probe: + self._run_unittest(model_name) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py b/qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py new file mode 100755 index 0000000000..06b5cd7fad --- /dev/null +++ b/qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py @@ -0,0 +1,111 @@ +#!/usr/bin/env python +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../../common") + +# GRPC streaming helpers.. +import queue +import unittest +from functools import partial + +import numpy as np +import test_util as tu +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException + + +class UserData: + def __init__(self): + self._completed_requests = queue.Queue() + + +def callback(user_data, result, error): + if error: + user_data._completed_requests.put(error) + else: + user_data._completed_requests.put(result) + + +class GrpcEndpointTest(tu.TestResultCollector): + def test_grpc_decoupled(self, sequence_id=0, sequence_start=False): + user_data = UserData() + with grpcclient.InferenceServerClient("localhost:8001") as triton_client: + # Reload the model to reset the flag + triton_client.unload_model("iterative_sequence") + triton_client.load_model("iterative_sequence") + + triton_client.start_stream(callback=partial(callback, user_data)) + inputs = [] + inputs.append(grpcclient.InferInput("IN", [1], "INT32")) + inputs[0].set_data_from_numpy(np.array([3], dtype=np.int32)) + + triton_client.async_stream_infer( + model_name="iterative_sequence", + inputs=inputs, + sequence_id=sequence_id, + sequence_start=sequence_start, + ) + res_count = 3 + while res_count > 0: + data_item = user_data._completed_requests.get() + res_count -= 1 + if type(data_item) == InferenceServerException: + raise data_item + else: + self.assertEqual(res_count, data_item.as_numpy("OUT")[0]) + self.assertEqual(0, res_count) + + def test_grpc_non_decoupled(self, sequence_id=0, sequence_start=False): + with grpcclient.InferenceServerClient("localhost:8001") as triton_client: + # Reload the model to reset the flag + triton_client.unload_model("request_rescheduling_addsub") + triton_client.load_model("request_rescheduling_addsub") + + inputs = [] + inputs.append(grpcclient.InferInput("INPUT0", [16], "FP32")) + inputs.append(grpcclient.InferInput("INPUT1", [16], "FP32")) + input0_val = np.random.randn(*[16]).astype(np.float32) + input1_val = np.random.randn(*[16]).astype(np.float32) + inputs[0].set_data_from_numpy(input0_val) + inputs[1].set_data_from_numpy(input1_val) + + results = triton_client.infer( + model_name="request_rescheduling_addsub", + inputs=inputs, + ) + + output0_data = results.as_numpy("OUTPUT0") + output1_data = results.as_numpy("OUTPUT1") + + self.assertTrue(np.array_equal(output0_data, input0_val + input1_val)) + self.assertTrue(np.array_equal(output1_data, input0_val - input1_val)) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_backend_python/request_rescheduling/test.sh b/qa/L0_backend_python/request_rescheduling/test.sh new file mode 100755 index 0000000000..8dc43dc83f --- /dev/null +++ b/qa/L0_backend_python/request_rescheduling/test.sh @@ -0,0 +1,116 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +CLIENT_PY=../python_unittest.py +CLIENT_LOG="./request_rescheduling_client.log" +EXPECTED_NUM_TESTS="1" +TEST_RESULT_FILE='test_results.txt' +source ../../common/util.sh + +TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} +SERVER=${TRITON_DIR}/bin/tritonserver +BACKEND_DIR=${TRITON_DIR}/backends + +RET=0 + +rm -fr *.log ./models *.txt + +mkdir -p models/bls_request_rescheduling/1/ +cp ../../python_models/bls_request_rescheduling/model.py models/bls_request_rescheduling/1/ +cp ../../python_models/bls_request_rescheduling/config.pbtxt models/bls_request_rescheduling + +mkdir -p models/request_rescheduling_addsub/1/ +cp ../../python_models/request_rescheduling_addsub/model.py models/request_rescheduling_addsub/1/ +cp ../../python_models/request_rescheduling_addsub/config.pbtxt models/request_rescheduling_addsub + +mkdir -p models/iterative_sequence/1/ +cp ../../python_models/iterative_sequence/model.py models/iterative_sequence/1/ +cp ../../python_models/iterative_sequence/config.pbtxt models/iterative_sequence + +mkdir -p models/wrong_return_type/1/ +cp ../../python_models/wrong_return_type/model.py models/wrong_return_type/1/ +cp ../../python_models/wrong_return_type/config.pbtxt models/wrong_return_type + +SERVER_LOG="./request_rescheduling_server.log" +SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --model-control-mode=explicit --load-model=* --log-verbose=1" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +export MODEL_NAME='bls_request_rescheduling' + +set +e +python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** bls_request_rescheduling test FAILED. \n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +GRPC_TEST_PY=./grpc_endpoint_test.py +EXPECTED_NUM_TESTS="2" + +set +e +python3 $GRPC_TEST_PY >> $CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** GRPC Endpoint test FAILED. \n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + + +if [ $RET -eq 1 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Request Rescheduling test FAILED. \n***" +else + echo -e "\n***\n*** Request Rescheduling test PASSED. \n***" +fi + +exit $RET diff --git a/qa/L0_backend_python/restart/models/restart/1/model.py b/qa/L0_backend_python/restart/models/restart/1/model.py index 72bce2933a..1f7491498e 100644 --- a/qa/L0_backend_python/restart/models/restart/1/model.py +++ b/qa/L0_backend_python/restart/models/restart/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,29 +24,30 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils -import c_python_backend_utils as c_utils from os import path +import c_python_backend_utils as c_utils +import triton_python_backend_utils as pb_utils + class TritonPythonModel: - def execute(self, requests): # This function will be called once to record the free memory. Then, # the stub process will be killed to trigger Python backend restart. # After that this value will be read again to make sure that it matches # before restart. - file_name = 'free_memory.txt' + file_name = "free_memory.txt" current_free_memory = str(c_utils.shared_memory.free_memory()) if path.exists(file_name): - with open(file_name, 'r') as f: + with open(file_name, "r") as f: expected_free_memory = f.read() - assert expected_free_memory == current_free_memory, \ - (f'Free shared memory before and after restart are not equal. ' - '{expected_free_memory} (before) != {current_free_memory} (after).') + assert expected_free_memory == current_free_memory, ( + f"Free shared memory before and after restart are not equal. " + "{expected_free_memory} (before) != {current_free_memory} (after)." + ) else: - with open(file_name, 'w') as f: + with open(file_name, "w") as f: f.write(current_free_memory) responses = [] diff --git a/qa/L0_backend_python/restart/restart_test.py b/qa/L0_backend_python/restart/restart_test.py old mode 100644 new mode 100755 index 534642c2e1..4f4bf63082 --- a/qa/L0_backend_python/restart/restart_test.py +++ b/qa/L0_backend_python/restart/restart_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,32 +27,34 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../../common") +import unittest + +import numpy as np import test_util as tu import tritonclient.http as httpclient from tritonclient.utils import * -import numpy as np -import unittest class RestartTest(tu.TestResultCollector): - def _infer_helper(self, model_name, shape, data_type): with httpclient.InferenceServerClient("localhost:8000") as client: input_data_0 = np.array(np.random.randn(*shape), dtype=data_type) inputs = [ - httpclient.InferInput("INPUT0", shape, - np_to_triton_dtype(input_data_0.dtype)) + httpclient.InferInput( + "INPUT0", shape, np_to_triton_dtype(input_data_0.dtype) + ) ] inputs[0].set_data_from_numpy(input_data_0) result = client.infer(model_name, inputs) - output0 = result.as_numpy('OUTPUT0') + output0 = result.as_numpy("OUTPUT0") self.assertTrue(np.all(input_data_0 == output0)) def test_restart(self): shape = [1, 16] - model_name = 'restart' + model_name = "restart" dtype = np.float32 # Since the stub process has been killed, the first request @@ -64,10 +68,10 @@ def test_restart(self): def test_infer(self): shape = [1, 16] - model_name = 'restart' + model_name = "restart" dtype = np.float32 self._infer_helper(model_name, shape, dtype) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_backend_python/restart/test.sh b/qa/L0_backend_python/restart/test.sh old mode 100644 new mode 100755 index 64c80332ac..f016af54c3 --- a/qa/L0_backend_python/restart/test.sh +++ b/qa/L0_backend_python/restart/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,13 +25,13 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -CLIENT_LOG="./client.log" +CLIENT_LOG="./restart_client.log" EXPECTED_NUM_TESTS="7" TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" -SERVER_LOG="./inference_server.log" +SERVER_LOG="./restart_server.log" source ../../common/util.sh source ../common.sh @@ -127,4 +127,3 @@ else fi exit $RET - diff --git a/qa/L0_backend_python/setup_python_enviroment.sh b/qa/L0_backend_python/setup_python_enviroment.sh new file mode 100755 index 0000000000..90d0f6eaf2 --- /dev/null +++ b/qa/L0_backend_python/setup_python_enviroment.sh @@ -0,0 +1,171 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +RET=0 +set -e +if [ ${PYTHON_ENV_VERSION} = "10" ]; then + echo No need to set up anything for default python3.${PYTHON_ENV_VERSION} + exit $RET +fi + +source common.sh +source ../common/util.sh + +SERVER=/opt/tritonserver/bin/tritonserver +BASE_SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 --disable-auto-complete-config" +PYTHON_BACKEND_BRANCH=$PYTHON_BACKEND_REPO_TAG +SERVER_ARGS=$BASE_SERVER_ARGS +SERVER_LOG="./inference_server.log" +export PYTHON_ENV_VERSION=${PYTHON_ENV_VERSION:="10"} +RET=0 +EXPECTED_VERSION_STRINGS="" + +rm -fr ./models +rm -rf *.tar.gz +install_build_deps +install_conda + +# Test other python versions +conda update -n base -c defaults conda -y +# Create a model with python 3.8 version +# Successful execution of the Python model indicates that the environment has +# been setup correctly. +if [ ${PYTHON_ENV_VERSION} = "8" ]; then + create_conda_env "3.8" "python-3-8" + conda install -c conda-forge libstdcxx-ng=12 -y + conda install numpy=1.23.4 -y + conda install tensorflow=2.10.0 -y + EXPECTED_VERSION_STRING="Python version is 3.8, NumPy version is 1.23.4, and Tensorflow version is 2.10.0" + create_python_backend_stub + conda-pack -o python3.8.tar.gz + path_to_conda_pack="$PWD/python-3-8" + mkdir -p $path_to_conda_pack + tar -xzf python3.8.tar.gz -C $path_to_conda_pack + mkdir -p models/python_3_8/1/ + cp ../python_models/python_version/config.pbtxt ./models/python_3_8 + (cd models/python_3_8 && \ + sed -i "s/^name:.*/name: \"python_3_8\"/" config.pbtxt && \ + echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt) + cp ../python_models/python_version/model.py ./models/python_3_8/1/ + cp python_backend/builddir/triton_python_backend_stub ./models/python_3_8 +fi + +# Create a model with python 3.9 version +# Successful execution of the Python model indicates that the environment has +# been setup correctly. +if [ ${PYTHON_ENV_VERSION} = "9" ]; then + create_conda_env "3.9" "python-3-9" + conda install -c conda-forge libstdcxx-ng=12 -y + conda install numpy=1.23.4 -y + conda install tensorflow=2.10.0 -y + EXPECTED_VERSION_STRING="Python version is 3.9, NumPy version is 1.23.4, and Tensorflow version is 2.10.0" + create_python_backend_stub + conda-pack -o python3.9.tar.gz + path_to_conda_pack="$PWD/python-3-9" + mkdir -p $path_to_conda_pack + tar -xzf python3.9.tar.gz -C $path_to_conda_pack + mkdir -p models/python_3_9/1/ + cp ../python_models/python_version/config.pbtxt ./models/python_3_9 + (cd models/python_3_9 && \ + sed -i "s/^name:.*/name: \"python_3_9\"/" config.pbtxt && \ + echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt) + cp ../python_models/python_version/model.py ./models/python_3_9/1/ + cp python_backend/builddir/triton_python_backend_stub ./models/python_3_9 +fi + +# Create a model with python 3.11 version +# Successful execution of the Python model indicates that the environment has +# been setup correctly. +if [ ${PYTHON_ENV_VERSION} = "11" ]; then + create_conda_env "3.11" "python-3-11" + # tensorflow needs to be installed before numpy so pip does not mess up conda + # environment + pip install tensorflow==2.12.0 + conda install -c conda-forge libstdcxx-ng=12 -y + conda install numpy=1.23.5 -y + EXPECTED_VERSION_STRING="Python version is 3.11, NumPy version is 1.23.5, and Tensorflow version is 2.12.0" + create_python_backend_stub + conda-pack -o python3.11.tar.gz + path_to_conda_pack="$PWD/python-3-11" + mkdir -p $path_to_conda_pack + tar -xzf python3.11.tar.gz -C $path_to_conda_pack + mkdir -p models/python_3_11/1/ + cp ../python_models/python_version/config.pbtxt ./models/python_3_11 + (cd models/python_3_11 && \ + sed -i "s/^name:.*/name: \"python_3_11\"/" config.pbtxt && \ + echo "parameters: {key: \"EXECUTION_ENV_PATH\", value: {string_value: \"$path_to_conda_pack\"}}">> config.pbtxt) + cp ../python_models/python_version/model.py ./models/python_3_11/1/ + cp python_backend/builddir/triton_python_backend_stub ./models/python_3_11 +fi +conda deactivate +rm -rf ./miniconda + +# test that +set +e +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +kill $SERVER_PID +wait $SERVER_PID + +grep "$EXPECTED_VERSION_STRING" $SERVER_LOG +if [ $? -ne 0 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** $EXPECTED_VERSION_STRING was not found in Triton logs. \n***" + RET=1 +fi +set -e + +echo "python environment 3.${PYTHON_ENV_VERSION}" +# copy the stub out to /opt/tritonserver/backends/python/triton_python_backend_stub +cp python_backend/builddir/triton_python_backend_stub /opt/tritonserver/backends/python/triton_python_backend_stub +# Set up environment and stub for each test +add-apt-repository ppa:deadsnakes/ppa -y +apt-get update && apt-get -y install \ + "python3.${PYTHON_ENV_VERSION}-dev" \ + "python3.${PYTHON_ENV_VERSION}-distutils" \ + libboost-dev +rm -f /usr/bin/python3 && \ +ln -s "/usr/bin/python3.${PYTHON_ENV_VERSION}" /usr/bin/python3 +pip3 install --upgrade install requests numpy virtualenv protobuf +find /opt/tritonserver/qa/pkgs/ -maxdepth 1 -type f -name \ + "tritonclient-*linux*.whl" | xargs printf -- '%s[all]' | \ + xargs pip3 install --upgrade + +# Build triton-shm-monitor for the test +cd python_backend && rm -rf install build && mkdir build && cd build && \ + cmake -DCMAKE_INSTALL_PREFIX:PATH=$PWD/install \ + -DTRITON_COMMON_REPO_TAG:STRING=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG:STRING=${TRITON_CORE_REPO_TAG} \ + -DTRITON_BACKEND_REPO_TAG:STRING=${TRITON_BACKEND_REPO_TAG} .. && \ + make -j16 triton-shm-monitor install +cp $PWD/install/backends/python/triton_shm_monitor.cpython-* /opt/tritonserver/qa/common/. +set +e +exit $RET diff --git a/qa/L0_backend_python/test.sh b/qa/L0_backend_python/test.sh index a4e11dfc9e..449cee8480 100755 --- a/qa/L0_backend_python/test.sh +++ b/qa/L0_backend_python/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -44,6 +44,7 @@ SERVER=${TRITON_DIR}/bin/tritonserver export BACKEND_DIR=${TRITON_DIR}/backends export TEST_JETSON=${TEST_JETSON:=0} export CUDA_VISIBLE_DEVICES=0 +export PYTHON_ENV_VERSION=${PYTHON_ENV_VERSION:="10"} BASE_SERVER_ARGS="--model-repository=`pwd`/models --backend-directory=${BACKEND_DIR} --log-verbose=1" # Set the default byte size to 5MBs to avoid going out of shared memory. The @@ -53,7 +54,7 @@ SERVER_ARGS="$BASE_SERVER_ARGS --backend-config=python,shm-default-byte-size=524 PYTHON_BACKEND_BRANCH=$PYTHON_BACKEND_REPO_TAG CLIENT_PY=./python_test.py CLIENT_LOG="./client.log" -EXPECTED_NUM_TESTS="9" +EXPECTED_NUM_TESTS="11" TEST_RESULT_FILE='test_results.txt' SERVER_LOG="./inference_server.log" source ../common/util.sh @@ -61,6 +62,20 @@ source ./common.sh rm -fr *.log ./models +python3 --version | grep "3.10" > /dev/null +if [ $? -ne 0 ]; then + echo -e "Expecting Python default version to be: Python 3.10 but actual version is $(python3 --version)" + exit 1 +fi + +(bash -ex setup_python_enviroment.sh) + +python3 --version | grep "3.${PYTHON_ENV_VERSION}" > /dev/null +if [ $? -ne 0 ]; then + echo -e "Expecting Python version to be: Python 3.${PYTHON_ENV_VERSION} but actual version is $(python3 --version)" + exit 1 +fi + mkdir -p models/identity_fp32/1/ cp ../python_models/identity_fp32/model.py ./models/identity_fp32/1/model.py cp ../python_models/identity_fp32/config.pbtxt ./models/identity_fp32/config.pbtxt @@ -128,19 +143,23 @@ mkdir -p models/string_fixed/1/ cp ../python_models/string_fixed/model.py ./models/string_fixed/1/ cp ../python_models/string_fixed/config.pbtxt ./models/string_fixed -# Skip torch install on Jetson since it is already installed. +mkdir -p models/dlpack_identity/1/ +cp ../python_models/dlpack_identity/model.py ./models/dlpack_identity/1/ +cp ../python_models/dlpack_identity/config.pbtxt ./models/dlpack_identity + if [ "$TEST_JETSON" == "0" ]; then - pip3 install torch==1.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html + pip3 install torch==1.13.0+cpu -f https://download.pytorch.org/whl/torch_stable.html else - # test_growth_error is skipped on jetson - EXPECTED_NUM_TESTS=8 + pip3 install torch==1.13.0 -f https://download.pytorch.org/whl/torch_stable.html + # GPU tensor tests are disabled on jetson + EXPECTED_NUM_TESTS=9 fi prev_num_pages=`get_shm_pages` run_server if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" cat $SERVER_LOG + echo -e "\n***\n*** Failed to start $SERVER\n***" exit 1 fi @@ -176,8 +195,8 @@ prev_num_pages=`get_shm_pages` # Triton non-graceful exit run_server if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" cat $SERVER_LOG + echo -e "\n***\n*** Failed to start $SERVER\n***" exit 1 fi @@ -216,8 +235,8 @@ if [ "$TEST_JETSON" == "0" ]; then prev_num_pages=`get_shm_pages` run_server if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" cat $SERVER_LOG + echo -e "\n***\n*** Failed to start $SERVER\n***" exit 1 fi @@ -252,8 +271,8 @@ cp ../python_models/identity_fp32/config.pbtxt ./models/multi_file/ prev_num_pages=`get_shm_pages` run_server if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" cat $SERVER_LOG + echo -e "\n***\n*** Failed to start $SERVER\n***" exit 1 fi @@ -286,9 +305,9 @@ export MY_ENV="MY_ENV" prev_num_pages=`get_shm_pages` run_server if [ "$SERVER_PID" == "0" ]; then + cat $SERVER_LOG echo -e "\n***\n*** Failed to start $SERVER\n***" echo -e "\n***\n*** Environment variable test failed \n***" - cat $SERVER_LOG exit 1 fi @@ -315,8 +334,8 @@ SERVER_ARGS="$BASE_SERVER_ARGS --backend-config=python,shm-default-byte-size=$sh run_server if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" cat $SERVER_LOG + echo -e "\n***\n*** Failed to start $SERVER\n***" exit 1 fi @@ -336,77 +355,95 @@ done kill $SERVER_PID wait $SERVER_PID -# Disable env test for Jetson since build is non-dockerized and cloud storage repos are not supported -# Disable ensemble, unittest, io and bls tests for Jetson since GPU Tensors are not supported -# Disable variants test for Jetson since already built without GPU Tensor support -# Disable decoupled test because it uses GPU tensors -if [ "$TEST_JETSON" == "0" ]; then - (cd env && bash -ex test.sh) - if [ $? -ne 0 ]; then - RET=1 - fi - - (cd ensemble && bash -ex test.sh) - if [ $? -ne 0 ]; then - RET=1 - fi - - (cd unittest && bash -ex test.sh) - if [ $? -ne 0 ]; then - RET=1 - fi - - (cd io && bash -ex test.sh) - if [ $? -ne 0 ]; then - RET=1 - fi - - (cd bls && bash -ex test.sh) - if [ $? -ne 0 ]; then - RET=1 - fi +# Test model getting killed during initialization +rm -fr ./models +mkdir -p models/init_exit/1/ +cp ../python_models/init_exit/model.py ./models/init_exit/1/model.py +cp ../python_models/init_exit/config.pbtxt ./models/init_exit/config.pbtxt - (cd decoupled && bash -ex test.sh) - if [ $? -ne 0 ]; then - RET=1 - fi +ERROR_MESSAGE="Stub process 'init_exit_0_0' is not healthy." - (cd variants && bash -ex test.sh) - if [ $? -ne 0 ]; then +prev_num_pages=`get_shm_pages` +run_server +if [ "$SERVER_PID" != "0" ]; then + echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG RET=1 - fi + kill $SERVER_PID + wait $SERVER_PID +else + if grep "$ERROR_MESSAGE" $SERVER_LOG; then + echo -e "Found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG + else + echo $CLIENT_LOG + echo -e "Not found \"$ERROR_MESSAGE\"" >> $CLIENT_LOG + RET=1 + fi fi -(cd lifecycle && bash -ex test.sh) -if [ $? -ne 0 ]; then - RET=1 +current_num_pages=`get_shm_pages` +if [ $current_num_pages -ne $prev_num_pages ]; then + cat $SERVER_LOG + ls /dev/shm + echo -e "\n***\n*** Test Failed. Shared memory pages where not cleaned properly. +Shared memory pages before starting triton equals to $prev_num_pages +and shared memory pages after starting triton equals to $current_num_pages \n***" + exit 1 fi -(cd restart && bash -ex test.sh) -if [ $? -ne 0 ]; then - RET=1 +# Disable env test for Jetson since cloud storage repos are not supported +# Disable ensemble, io and bls tests for Jetson since GPU Tensors are not supported +# Disable variants test for Jetson since already built without GPU Tensor support +# Disable decoupled test because it uses GPU tensors +if [ "$TEST_JETSON" == "0" ]; then + SUBTESTS="ensemble io bls decoupled variants python_based_backends" + for TEST in ${SUBTESTS}; do + # Run each subtest in a separate virtual environment to avoid conflicts + # between dependencies. + virtualenv --system-site-packages venv + source venv/bin/activate + + (cd ${TEST} && bash -ex test.sh) + if [ $? -ne 0 ]; then + echo "Subtest ${TEST} FAILED" + RET=1 + fi + + deactivate + rm -fr venv + done + + if [ ${PYTHON_ENV_VERSION} = "10" ]; then + # In 'env' test we use miniconda for dependency management. No need to run + # the test in a virtual environment. + (cd env && bash -ex test.sh) + if [ $? -ne 0 ]; then + echo "Subtest env FAILED" + RET=1 + fi + fi fi -(cd model_control && bash -ex test.sh) -if [ $? -ne 0 ]; then - RET=1 -fi +SUBTESTS="lifecycle restart model_control examples argument_validation logging custom_metrics request_rescheduling" +for TEST in ${SUBTESTS}; do + # Run each subtest in a separate virtual environment to avoid conflicts + # between dependencies. + virtualenv --system-site-packages venv + source venv/bin/activate -(cd examples && bash -ex test.sh) -if [ $? -ne 0 ]; then - RET=1 -fi + (cd ${TEST} && bash -ex test.sh) -(cd argument_validation && bash -ex test.sh) -if [ $? -ne 0 ]; then - RET=1 -fi + if [ $? -ne 0 ]; then + echo "Subtest ${TEST} FAILED" + RET=1 + fi + deactivate + rm -fr venv +done if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else - cat $SERVER_LOG echo -e "\n***\n*** Test FAILED\n***" fi diff --git a/qa/L0_backend_python/variants/test.sh b/qa/L0_backend_python/variants/test.sh old mode 100644 new mode 100755 index 24ceb1cf4c..65116cb2dc --- a/qa/L0_backend_python/variants/test.sh +++ b/qa/L0_backend_python/variants/test.sh @@ -25,7 +25,7 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# Buidling a CPU build of Python backend +# Building a CPU build of Python backend source ../common.sh install_build_deps diff --git a/qa/L0_backend_tutorial/test.sh b/qa/L0_backend_tutorial/test.sh index c745ea7ed2..4706c2c2dd 100755 --- a/qa/L0_backend_tutorial/test.sh +++ b/qa/L0_backend_tutorial/test.sh @@ -40,13 +40,14 @@ source ../common/util.sh RET=0 # Client build requires recent version of CMake (FetchContent required) -wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \ - gpg --dearmor - | \ - tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \ - apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \ - apt-get update && \ - apt-get install -y --no-install-recommends \ - cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1 \ +# Using CMAKE installation instruction from:: https://apt.kitware.com/ +apt update -q=2 \ + && apt install -y gpg wget \ + && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \ + && . /etc/os-release \ + && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \ + && apt-get update -q=2 \ + && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* \ rapidjson-dev cmake --version @@ -186,8 +187,16 @@ if [ $? -ne 0 ]; then RET=1 fi +FOUND_MATCH=0 grep "batched INPUT value: \[ 1.000000, 1.100000, 1.200000, 1.300000, 2.000000, 2.100000, 2.200000, 2.300000, 3.000000, 3.100000, 3.200000, 3.300000, 4.000000, 4.100000, 4.200000, 4.300000, 10.000000, 10.100000, 10.200000, 10.300000, 20.000000, 20.100000, 20.200001, 20.299999, 30.000000, 30.100000, 30.200001, 30.299999, 40.000000, 40.099998, 40.200001, 40.299999 \]" $SERVER_LOG if [ $? -ne 0 ]; then + FOUND_MATCH=1 +fi +grep "batched INPUT value: \[ 10.000000, 10.100000, 10.200000, 10.300000, 20.000000, 20.100000, 20.200001, 20.299999, 30.000000, 30.100000, 30.200001, 30.299999, 40.000000, 40.099998, 40.200001, 40.299999, 1.000000, 1.100000, 1.200000, 1.300000, 2.000000, 2.100000, 2.200000, 2.300000, 3.000000, 3.100000, 3.200000, 3.300000, 4.000000, 4.100000, 4.200000, 4.300000 \]" $SERVER_LOG +if [ $? -ne 0 ]; then + FOUND_MATCH=1 +fi +if [ $FOUND_MATCH -eq 0 ]; then echo -e "\n***\n*** Failed to verify recommended server log. \n***" cat $SERVER_LOG cat $RECOMMENDED_LOG diff --git a/qa/L0_batch_custom/batch_custom_test.py b/qa/L0_batch_custom/batch_custom_test.py new file mode 100755 index 0000000000..6cd6346ad3 --- /dev/null +++ b/qa/L0_batch_custom/batch_custom_test.py @@ -0,0 +1,273 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import os +import threading +import time +import unittest +from builtins import range +from collections.abc import Iterable + +import infer_util as iu +import numpy as np +import test_util as tu +import tritonclient.grpc as grpcclient + +# By default, find tritonserver on "localhost", but can be overridden +# with TRITONSERVER_IPADDR envvar +_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost") + +_deferred_exceptions_lock = threading.Lock() +_deferred_exceptions = [] + + +class BatcherTest(tu.TestResultCollector): + def setUp(self): + # The helper client for setup will be GRPC for simplicity. + self.triton_client_ = grpcclient.InferenceServerClient( + f"{_tritonserver_ipaddr}:8001" + ) + self.precreated_shm_regions_ = [] + global _deferred_exceptions + _deferred_exceptions = [] + + def tearDown(self): + super().tearDown() + + def add_deferred_exception(self, ex): + global _deferred_exceptions + with _deferred_exceptions_lock: + _deferred_exceptions.append(ex) + + def check_deferred_exception(self): + # Just raise one of the exceptions... + with _deferred_exceptions_lock: + if len(_deferred_exceptions) > 0: + raise _deferred_exceptions[0] + + def check_response( + self, + trial, + bs, + thresholds, + requested_outputs=("OUTPUT0", "OUTPUT1"), + input_size=16, + shm_region_names=None, + precreated_shm_regions=None, + ): + try: + start_ms = int(round(time.time() * 1000)) + + if ( + trial == "savedmodel" + or trial == "graphdef" + or trial == "libtorch" + or trial == "onnx" + or trial == "plan" + or trial == "python" + ): + tensor_shape = (bs, input_size) + iu.infer_exact( + self, + trial, + tensor_shape, + bs, + np.float32, + np.float32, + np.float32, + swap=False, + model_version=1, + outputs=requested_outputs, + use_http=False, + use_grpc=False, + use_http_json_tensors=False, + skip_request_id_check=True, + use_streaming=False, + ) + else: + self.assertFalse(True, "unknown trial type: " + trial) + + end_ms = int(round(time.time() * 1000)) + + lt_ms = thresholds[0] + gt_ms = thresholds[1] + if lt_ms is not None: + self.assertTrue( + (end_ms - start_ms) < lt_ms, + "expected less than " + + str(lt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) + if gt_ms is not None: + self.assertTrue( + (end_ms - start_ms) > gt_ms, + "expected greater than " + + str(gt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) + except Exception as ex: + self.add_deferred_exception(ex) + + def check_status(self, model_name, batch_exec, request_cnt, infer_cnt, exec_count): + # There is a time window between when responses are returned and statistics are updated. + # To prevent intermittent test failure during that window, wait up to 10 seconds for the + # inference statistics to be ready. + num_tries = 10 + for i in range(num_tries): + stats = self.triton_client_.get_inference_statistics(model_name, "1") + self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") + actual_exec_cnt = stats.model_stats[0].execution_count + if actual_exec_cnt == exec_count: + break + print( + "WARNING: expect {} executions, got {} (attempt {})".format( + exec_count, actual_exec_cnt, i + ) + ) + time.sleep(1) + + self.assertEqual( + stats.model_stats[0].name, + model_name, + "expect model stats for model {}".format(model_name), + ) + self.assertEqual( + stats.model_stats[0].version, + "1", + "expect model stats for model {} version 1".format(model_name), + ) + + if batch_exec: + batch_stats = stats.model_stats[0].batch_stats + self.assertEqual( + len(batch_stats), + len(batch_exec), + "expected {} different batch-sizes, got {}".format( + len(batch_exec), len(batch_stats) + ), + ) + + for batch_stat in batch_stats: + bs = batch_stat.batch_size + bc = batch_stat.compute_infer.count + self.assertTrue(bs in batch_exec, "unexpected batch-size {}".format(bs)) + # Get count from one of the stats + self.assertEqual( + bc, + batch_exec[bs], + "expected model-execution-count {} for batch size {}, got {}".format( + batch_exec[bs], bs, bc + ), + ) + + actual_request_cnt = stats.model_stats[0].inference_stats.success.count + self.assertEqual( + actual_request_cnt, + request_cnt, + "expected model-request-count {}, got {}".format( + request_cnt, actual_request_cnt + ), + ) + + actual_exec_cnt = stats.model_stats[0].execution_count + if isinstance(exec_count, Iterable): + self.assertIn( + actual_exec_cnt, + exec_count, + "expected model-exec-count {}, got {}".format( + exec_count, actual_exec_cnt + ), + ) + else: + self.assertEqual( + actual_exec_cnt, + exec_count, + "expected model-exec-count {}, got {}".format( + exec_count, actual_exec_cnt + ), + ) + actual_infer_cnt = stats.model_stats[0].inference_count + self.assertEqual( + actual_infer_cnt, + infer_cnt, + "expected model-inference-count {}, got {}".format( + infer_cnt, actual_infer_cnt + ), + ) + + def test_volume_batching(self): + # Send 12 requests with batch size 1. The max_queue_delay is set + # to non-zero. Depending upon the timing of the requests arrival + # there can be either 4-6 model executions. + model_base = "onnx" + dtype = np.float16 + shapes = ( + [ + 1, + 4, + 4, + ], + ) + + try: + # use threads to send 12 requests without waiting for response + threads = [] + for i in range(12): + threads.append( + threading.Thread( + target=iu.infer_zero, + args=(self, model_base, 1, dtype, shapes, shapes), + kwargs={ + "use_http": True, + "use_grpc": False, + "use_http_json_tensors": False, + "use_streaming": False, + }, + ) + ) + for t in threads: + t.start() + for t in threads: + t.join() + self.check_deferred_exception() + model_name = tu.get_zero_model_name(model_base, len(shapes), dtype) + self.check_status(model_name, None, 12, 12, (4, 5, 6)) + except Exception as ex: + self.assertTrue(False, "unexpected error {}".format(ex)) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_batch_custom/test.sh b/qa/L0_batch_custom/test.sh new file mode 100755 index 0000000000..01701df661 --- /dev/null +++ b/qa/L0_batch_custom/test.sh @@ -0,0 +1,192 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +## This test tests the ability to use custom batching strategies with models. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +export CUDA_VISIBLE_DEVICES=0 + +BATCH_CUSTOM_TEST=batch_custom_test.py +CLIENT_LOG_BASE="./client.log" +DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_identity_model_repository +EXPECTED_NUM_TESTS="1" +MODEL_NAME="onnx_zero_1_float16" +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_ARGS="--model-repository=models --log-verbose 1" +SERVER_LOG_BASE="./inference_server.log" +TEST_RESULT_FILE='test_results.txt' +TRITON_BACKEND_REPO_TAG=${TRITON_BACKEND_REPO_TAG:="main"} +TRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG:="main"} + +source ../common/util.sh +RET=0 + +# Batch strategy build requires recent version of CMake (FetchContent required) +# Using CMAKE installation instruction from:: https://apt.kitware.com/ +apt update -q=2 \ + && apt install -y gpg wget \ + && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \ + && . /etc/os-release \ + && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \ + && apt-get update -q=2 \ + && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* rapidjson-dev +cmake --version + +# Set up repository +rm -fr *.log* ./backend +rm -fr models && mkdir models +cp -r $DATADIR/$MODEL_NAME models + +CONFIG_PATH="models/${MODEL_NAME}/config.pbtxt" +echo "dynamic_batching { max_queue_delay_microseconds: 10000}" >> ${CONFIG_PATH} +echo "instance_group [ { kind: KIND_GPU count: 2 }]" >> ${CONFIG_PATH} +echo "parameters { key: \"MAX_BATCH_VOLUME_BYTES\" value: {string_value: \"96\"}}" >> ${CONFIG_PATH} + +# Create custom batching libraries +git clone --single-branch --depth=1 -b $TRITON_BACKEND_REPO_TAG \ + https://github.com/triton-inference-server/backend.git + +(cd backend/examples/batching_strategies/volume_batching && + mkdir build && + cd build && + cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \ + -DTRITON_CORE_REPO_TAG=$TRITON_CORE_REPO_TAG .. && + make -j4 install) + + (cd backend/examples/batching_strategies/single_batching && + mkdir build && + cd build && + cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \ + -DTRITON_CORE_REPO_TAG=$TRITON_CORE_REPO_TAG .. && + make -j4 install) + +cp -r backend/examples/batching_strategies/volume_batching/build/libtriton_volumebatching.so models +cp -r backend/examples/batching_strategies/single_batching/build/libtriton_singlebatching.so models + +# Run a test to validate the single batching strategy example. +# Then, run tests to validate the volume batching example being passed in via the backend dir, model dir, version dir, and model config. +BACKEND_DIR="/opt/tritonserver/backends/onnxruntime" +MODEL_DIR="models/$MODEL_NAME" +VERSION_DIR="$MODEL_DIR/1/" + +test_types=('single_batching_backend' 'backend_directory' 'model_directory' 'version_directory' 'model_config') +test_setups=("cp models/libtriton_singlebatching.so ${BACKEND_DIR}/batchstrategy.so && sed -i \"s/(4, 5, 6))/(12))/\" ${BATCH_CUSTOM_TEST}" + "cp models/libtriton_volumebatching.so ${BACKEND_DIR}/batchstrategy.so && sed -i \"s/(12))/(4, 5, 6))/\" ${BATCH_CUSTOM_TEST}" + "mv ${BACKEND_DIR}/batchstrategy.so ${MODEL_DIR} && cp models/libtriton_singlebatching.so ${BACKEND_DIR}" + "mv ${MODEL_DIR}/batchstrategy.so ${VERSION_DIR}/batchstrategy.so" + "mv ${VERSION_DIR}/batchstrategy.so models/${MODEL_NAME}/libtriton_volumebatching.so && echo \"parameters: {key: \\\"TRITON_BATCH_STRATEGY_PATH\\\", value: {string_value: \\\"${MODEL_DIR}/libtriton_volumebatching.so\\\"}}\" >> ${CONFIG_PATH}") + +for i in "${!test_setups[@]}"; do + echo "Running ${test_types[$i]} test" + eval ${test_setups[$i]} + + SERVER_LOG=${SERVER_LOG_BASE}_${test_types[$i]} + CLIENT_LOG=${CLIENT_LOG_BASE}_${test_types[$i]} + + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + if [ `grep -c "Loading custom batching strategy" $SERVER_LOG` != "1" ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Failed to load custom batching strategy.***" + RET=1 + else + set +e + python $BATCH_CUSTOM_TEST >$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** ${test_types[$i]} Test Failed\n***" + RET=1 + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** ${test_types[$i]} Test Result Verification Failed\n***" + RET=1 + fi + fi + set -e + fi + + kill $SERVER_PID + wait $SERVER_PID +done + +# Test ModelBatchInitialize failure +FILE_PATH="backend/examples/batching_strategies/volume_batching/src/volume_batching.cc" +OLD_STRING="\/\/ Batcher will point to an unsigned integer representing the maximum" +NEW_STRING="return TRITONSERVER_ErrorNew(TRITONSERVER_ERROR_NOT_FOUND,\"Failure test case\");" + +sed -i "s/${OLD_STRING}/${NEW_STRING}/g" ${FILE_PATH} + +(cd backend/examples/batching_strategies/volume_batching && + cd build && + cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install \ + -DTRITON_CORE_REPO_TAG=$TRITON_CORE_REPO_TAG .. && + make -j4 install) + +cp -r backend/examples/batching_strategies/volume_batching/build/libtriton_volumebatching.so models/${MODEL_NAME}/libtriton_volumebatching.so + +SERVER_LOG=${SERVER_LOG_BASE}_batching_init_failure + +run_server +if [ "$SERVER_PID" != "0" ]; then + cat $SERVER_LOG + echo -e "\n***\n*** ModelBatchInit Error Test: unexpected successful server start $SERVER\n***" + kill_server + RET=1 +else + if [ `grep -c "Failure test case" $SERVER_LOG` -lt 1 ] || [ `grep -c "Not found" $SERVER_LOG` -lt 1 ]; then + cat $SERVER_LOG + echo -e "\n***\n*** ModelBatchInit Error Test: failed to find \"Failure test case\" message and/or \"Not found\" error type" + RET=1 + fi +fi + + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_batch_input/batch_input_test.py b/qa/L0_batch_input/batch_input_test.py old mode 100644 new mode 100755 index 2931dadbad..02de27d921 --- a/qa/L0_batch_input/batch_input_test.py +++ b/qa/L0_batch_input/batch_input_test.py @@ -1,4 +1,6 @@ -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,52 +27,68 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") +import queue import unittest +from functools import partial + import numpy as np -import infer_util as iu import test_util as tu -import tritonhttpclient -from tritonclientutils import InferenceServerException +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException class BatchInputTest(tu.TestResultCollector): - def setUp(self): + self.client = grpcclient.InferenceServerClient(url="localhost:8001") + + def callback(user_data, result, error): + if error: + user_data.put(error) + else: + user_data.put(result) + + self.client_callback = callback + + def set_inputs(self, shapes, input_name): self.dtype_ = np.float32 self.inputs = [] - # 4 set of inputs with shape [2], [4], [1], [3] - for value in [2, 4, 1, 3]: - self.inputs.append([ - tritonhttpclient.InferInput('RAGGED_INPUT', [1, value], "FP32") - ]) + for shape in shapes: + self.inputs.append( + [grpcclient.InferInput(input_name, [1, shape[0]], "FP32")] + ) self.inputs[-1][0].set_data_from_numpy( - np.full([1, value], value, np.float32)) - self.client = tritonhttpclient.InferenceServerClient( - url="localhost:8000", concurrency=len(self.inputs)) + np.full([1, shape[0]], shape[0], np.float32) + ) + + def set_inputs_for_batch_item(self, shapes, input_name): + self.dtype_ = np.float32 + self.inputs = [] + for shape in shapes: + self.inputs.append([grpcclient.InferInput(input_name, shape, "FP32")]) + self.inputs[-1][0].set_data_from_numpy(np.full(shape, shape[0], np.float32)) def test_ragged_output(self): model_name = "ragged_io" + # The model is an identity model + self.set_inputs([[2], [4], [1], [3]], "INPUT0") + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - # The model is identity model - self.inputs = [] - for value in [2, 4, 1, 3]: - self.inputs.append( - [tritonhttpclient.InferInput('INPUT0', [1, value], "FP32")]) - self.inputs[-1][0].set_data_from_numpy( - np.full([1, value], value, np.float32)) - output_name = 'OUTPUT0' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] + output_name = "OUTPUT0" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for inputs in self.inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - self.client.async_infer(model_name=model_name, - inputs=inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) expected_value_list = [[v] * v for v in [2, 4, 1, 3]] expected_value_list = [ @@ -80,31 +98,37 @@ def test_ragged_output(self): for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertTrue( np.array_equal(output_data, expected_value_list[idx]), "Expect response {} to have value {}, got {}".format( - idx, expected_value_list[idx], output_data)) + idx, expected_value_list[idx], output_data + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() def test_ragged_input(self): model_name = "ragged_acc_shape" + self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT") + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - output_name = 'RAGGED_OUTPUT' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] - + output_name = "RAGGED_OUTPUT" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for inputs in self.inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - self.client.async_infer(model_name=model_name, - inputs=inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) value_lists = [[v] * v for v in [2, 4, 1, 3]] expected_value = [] @@ -114,191 +138,218 @@ def test_ragged_input(self): for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() - + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertTrue( np.array_equal(output_data, expected_value), "Expect response {} to have value {}, got {}".format( - idx, expected_value, output_data)) + idx, expected_value, output_data + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() def test_element_count(self): model_name = "ragged_element_count_acc_zero" + self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT") + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - output_name = 'BATCH_AND_SIZE_OUTPUT' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] + output_name = "BATCH_AND_SIZE_OUTPUT" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for inputs in self.inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - self.client.async_infer(model_name=model_name, - inputs=inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) expected_value = np.asarray([[2, 4, 1, 3]], np.float32) for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertTrue( np.array_equal(output_data, expected_value), "Expect response {} to have value {}, got {}".format( - idx, expected_value, output_data)) + idx, expected_value, output_data + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() def test_accumulated_element_count(self): model_name = "ragged_acc_shape" + self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT") + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - output_name = 'BATCH_AND_SIZE_OUTPUT' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] + output_name = "BATCH_AND_SIZE_OUTPUT" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for inputs in self.inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - self.client.async_infer(model_name=model_name, - inputs=inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) expected_value = np.asarray([[2, 6, 7, 10]], np.float32) for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertTrue( np.array_equal(output_data, expected_value), "Expect response {} to have value {}, got {}".format( - idx, expected_value, output_data)) + idx, expected_value, output_data + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() def test_accumulated_element_count_with_zero(self): model_name = "ragged_element_count_acc_zero" + self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT") + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - output_name = 'BATCH_OUTPUT' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] + output_name = "BATCH_OUTPUT" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for inputs in self.inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - self.client.async_infer(model_name=model_name, - inputs=inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) expected_value = np.asarray([[0, 2, 6, 7, 10]], np.float32) for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertTrue( np.array_equal(output_data, expected_value), "Expect response {} to have value {}, got {}".format( - idx, expected_value, output_data)) + idx, expected_value, output_data + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() def test_max_element_count_as_shape(self): model_name = "ragged_acc_shape" + self.set_inputs([[2], [4], [1], [3]], "RAGGED_INPUT") + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - output_name = 'BATCH_OUTPUT' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] + output_name = "BATCH_OUTPUT" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for inputs in self.inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - self.client.async_infer(model_name=model_name, - inputs=inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertEqual( - output_data.shape, (1, 4), - "Expect response {} to have shape to represent max element count {} among the batch , got {}" - .format(idx, 4, output_data.shape)) + output_data.shape, + (1, 4), + "Expect response {} to have shape to represent max element count {} among the batch , got {}".format( + idx, 4, output_data.shape + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() def test_batch_item_shape_flatten(self): # Use 4 set of inputs with shape # [1, 4, 1], [1, 1, 2], [1, 1, 2], [1, 2, 2] # Note that the test only checks the formation of "BATCH_INPUT" where # the value of "RAGGED_INPUT" is irrelevant, only the shape matters - self.inputs = [] - for value in [[1, 4, 1], [1, 1, 2], [1, 1, 2], [1, 2, 2]]: - self.inputs.append( - [tritonhttpclient.InferInput('RAGGED_INPUT', value, "FP32")]) - self.inputs[-1][0].set_data_from_numpy( - np.full(value, value[0], np.float32)) - self.client = tritonhttpclient.InferenceServerClient( - url="localhost:8000", concurrency=len(self.inputs)) + self.set_inputs_for_batch_item( + [[1, 4, 1], [1, 1, 2], [1, 1, 2], [1, 2, 2]], "RAGGED_INPUT" + ) model_name = "batch_item_flatten" + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - output_name = 'BATCH_OUTPUT' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] + output_name = "BATCH_OUTPUT" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for inputs in self.inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - self.client.async_infer(model_name=model_name, - inputs=inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) expected_value = np.asarray([[4, 1, 1, 2, 1, 2, 2, 2]], np.float32) for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertTrue( np.array_equal(output_data, expected_value), "Expect response {} to have value {}, got {}".format( - idx, expected_value, output_data)) + idx, expected_value, output_data + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() def test_batch_item_shape(self): # Use 3 set of inputs with shape [2, 1, 2], [1, 1, 2], [1, 2, 2] # Note that the test only checks the formation of "BATCH_INPUT" where # the value of "RAGGED_INPUT" is irrelevant, only the shape matters - inputs = [] - for value in [[2, 1, 2], [1, 1, 2], [1, 2, 2]]: - inputs.append( - [tritonhttpclient.InferInput('RAGGED_INPUT', value, "FP32")]) - inputs[-1][0].set_data_from_numpy( - np.full(value, value[0], np.float32)) - client = tritonhttpclient.InferenceServerClient(url="localhost:8000", - concurrency=len(inputs)) + self.set_inputs_for_batch_item( + [[2, 1, 2], [1, 1, 2], [1, 2, 2]], "RAGGED_INPUT" + ) expected_outputs = [ np.array([[1.0, 2.0], [1.0, 2.0]]), @@ -307,34 +358,41 @@ def test_batch_item_shape(self): ] model_name = "batch_item" + user_data = queue.Queue() + self.client.start_stream(callback=partial(self.client_callback, user_data)) - output_name = 'BATCH_OUTPUT' - outputs = [tritonhttpclient.InferRequestedOutput(output_name)] + output_name = "BATCH_OUTPUT" + outputs = [grpcclient.InferRequestedOutput(output_name)] async_requests = [] try: - for request_inputs in inputs: + for input in self.inputs: # Asynchronous inference call. async_requests.append( - client.async_infer(model_name=model_name, - inputs=request_inputs, - outputs=outputs)) + self.client.async_stream_infer( + model_name=model_name, inputs=input, outputs=outputs + ) + ) for idx in range(len(async_requests)): # Get the result from the initiated asynchronous inference request. # Note the call will block till the server responds. - result = async_requests[idx].get_result() + result = user_data.get() # Validate the results by comparing with precomputed values. output_data = result.as_numpy(output_name) self.assertTrue( np.allclose(output_data, expected_outputs[idx]), - "Expect response to have value:\n{}, got:\n{}\nEqual matrix:\n{}" - .format(expected_outputs[idx], output_data, - np.isclose(expected_outputs[idx], output_data))) + "Expect response to have value:\n{}, got:\n{}\nEqual matrix:\n{}".format( + expected_outputs[idx], + output_data, + np.isclose(expected_outputs[idx], output_data), + ), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) + self.client.stop_stream() -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_batch_input/test.sh b/qa/L0_batch_input/test.sh old mode 100644 new mode 100755 index 56ca448f3a..e780516ec4 --- a/qa/L0_batch_input/test.sh +++ b/qa/L0_batch_input/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -54,7 +54,7 @@ SERVER_LOG="./inference_server.log" source ../common/util.sh # If BACKENDS not specified, set to all -BACKENDS=${BACKENDS:="onnx savedmodel plan"} +BACKENDS=${BACKENDS:="onnx savedmodel plan libtorch"} rm -f $SERVER_LOG $CLIENT_LOG @@ -82,6 +82,8 @@ for BACKEND in $BACKENDS; do # batch input is generated properly cp -r $IDENTITY_DATADIR/${BACKEND}_nobatch_zero_1_float32 models/ragged_io (cd models/ragged_io && \ + # In case of libtorch, update I/O names + sed -i "s/__0/0/" config.pbtxt && \ sed -i "s/${BACKEND}_nobatch_zero_1_float32/ragged_io/" config.pbtxt && \ sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \ sed -i "s/name: \"INPUT0\"/name: \"INPUT0\"\\nallow_ragged_batch: true/" config.pbtxt && \ @@ -99,7 +101,7 @@ for BACKEND in $BACKENDS; do fi set +e - python $BATCH_INPUT_TEST >$CLIENT_LOG 2>&1 + python3 $BATCH_INPUT_TEST >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" diff --git a/qa/L0_batcher/batcher_test.py b/qa/L0_batcher/batcher_test.py old mode 100644 new mode 100755 index 31382c5918..38e208c21e --- a/qa/L0_batcher/batcher_test.py +++ b/qa/L0_batcher/batcher_test.py @@ -1,4 +1,6 @@ -# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,27 +27,26 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range import os -import time import threading +import time import unittest -import numpy as np +from builtins import range + import infer_util as iu +import numpy as np import test_util as tu - import tritonclient.grpc as grpcclient # By default, find tritonserver on "localhost", but can be overridden # with TRITONSERVER_IPADDR envvar -_tritonserver_ipaddr = os.environ.get('TRITONSERVER_IPADDR', 'localhost') +_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost") -TEST_SYSTEM_SHARED_MEMORY = bool( - int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0))) -TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY', - 0))) +TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0))) +TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0))) if TEST_SYSTEM_SHARED_MEMORY: import tritonclient.utils.shared_memory as shm @@ -54,13 +55,13 @@ # Test with either GRPC of HTTP, but not both since when we check # results we expect only one to run -USE_GRPC = (os.environ.get('USE_GRPC', 1) != "0") -USE_HTTP = (os.environ.get('USE_HTTP', 1) != "0") +USE_GRPC = os.environ.get("USE_GRPC", 1) != "0" +USE_HTTP = os.environ.get("USE_HTTP", 1) != "0" if USE_GRPC and USE_HTTP: USE_GRPC = False assert USE_GRPC or USE_HTTP, "USE_GRPC or USE_HTTP must be non-zero" -BACKENDS = os.environ.get('BACKENDS', "graphdef savedmodel onnx libtorch plan") +BACKENDS = os.environ.get("BACKENDS", "graphdef savedmodel onnx libtorch plan python") _trials = BACKENDS.split(" ") @@ -69,6 +70,8 @@ _ragged_batch_supported_trials.append("plan") if "onnx" in _trials: _ragged_batch_supported_trials.append("onnx") +if "libtorch" in _trials: + _ragged_batch_supported_trials.append("libtorch") _max_queue_delay_ms = 10000 @@ -77,10 +80,11 @@ class BatcherTest(tu.TestResultCollector): - def setUp(self): # The helper client for setup will be GRPC for simplicity. - self.triton_client_ = grpcclient.InferenceServerClient(f"{_tritonserver_ipaddr}:8001") + self.triton_client_ = grpcclient.InferenceServerClient( + f"{_tritonserver_ipaddr}:8001" + ) self.precreated_shm_regions_ = [] global _deferred_exceptions _deferred_exceptions = [] @@ -102,19 +106,22 @@ def create_advance(self, shm_regions=None): if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: precreated_shm_regions = [] if shm_regions is None: - shm_regions = ['output0', 'output1'] + shm_regions = ["output0", "output1"] for shm_region in shm_regions: if TEST_SYSTEM_SHARED_MEMORY: shm_handle = shm.create_shared_memory_region( - shm_region + '_data', '/' + shm_region, 512) + shm_region + "_data", "/" + shm_region, 512 + ) self.triton_client_.register_system_shared_memory( - shm_region + '_data', '/' + shm_region, 512) + shm_region + "_data", "/" + shm_region, 512 + ) else: shm_handle = cudashm.create_shared_memory_region( - shm_region + '_data', 512, 0) + shm_region + "_data", 512, 0 + ) self.triton_client_.register_cuda_shared_memory( - shm_region + '_data', - cudashm.get_raw_handle(shm_handle), 0, 512) + shm_region + "_data", cudashm.get_raw_handle(shm_handle), 0, 512 + ) # Collect precreated handles for cleanup self.precreated_shm_regions_.append(shm_handle) precreated_shm_regions.append(shm_handle) @@ -132,19 +139,27 @@ def check_deferred_exception(self): if len(_deferred_exceptions) > 0: raise _deferred_exceptions[0] - def check_response(self, - trial, - bs, - thresholds, - requested_outputs=("OUTPUT0", "OUTPUT1"), - input_size=16, - shm_region_names=None, - precreated_shm_regions=None): + def check_response( + self, + trial, + bs, + thresholds, + requested_outputs=("OUTPUT0", "OUTPUT1"), + input_size=16, + shm_region_names=None, + precreated_shm_regions=None, + ): try: start_ms = int(round(time.time() * 1000)) - if trial == "savedmodel" or trial == "graphdef" or trial == "libtorch" \ - or trial == "onnx" or trial == "plan": + if ( + trial == "savedmodel" + or trial == "graphdef" + or trial == "libtorch" + or trial == "onnx" + or trial == "plan" + or trial == "python" + ): tensor_shape = (bs, input_size) iu.infer_exact( self, @@ -165,7 +180,8 @@ def check_response(self, shm_region_names=shm_region_names, precreated_shm_regions=precreated_shm_regions, use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, - use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY) + use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY, + ) else: self.assertFalse(True, "unknown trial type: " + trial) @@ -176,72 +192,109 @@ def check_response(self, if lt_ms is not None: self.assertTrue( (end_ms - start_ms) < lt_ms, - "expected less than " + str(lt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected less than " + + str(lt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) if gt_ms is not None: self.assertTrue( (end_ms - start_ms) > gt_ms, - "expected greater than " + str(gt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected greater than " + + str(gt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) except Exception as ex: self.add_deferred_exception(ex) - def check_setup(self, model_name, preferred_batch_sizes, - max_queue_delay_us): + def check_setup(self, model_name, preferred_batch_sizes, max_queue_delay_us): # Make sure test.sh set up the correct batcher settings config = self.triton_client_.get_model_config(model_name).config bconfig = config.dynamic_batching - self.assertEqual(len(bconfig.preferred_batch_size), - len(preferred_batch_sizes)) + self.assertEqual(len(bconfig.preferred_batch_size), len(preferred_batch_sizes)) for i in preferred_batch_sizes: self.assertTrue(i in bconfig.preferred_batch_size) - self.assertEqual(bconfig.max_queue_delay_microseconds, - max_queue_delay_us) + self.assertEqual(bconfig.max_queue_delay_microseconds, max_queue_delay_us) def check_status(self, model_name, batch_exec, request_cnt, infer_cnt, exec_count): - stats = self.triton_client_.get_inference_statistics(model_name, "1") - self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") - self.assertEqual(stats.model_stats[0].name, model_name, - "expect model stats for model {}".format(model_name)) + # There is a time window between when responses are returned and statistics are updated. + # To prevent intermittent test failure during that window, wait up to 10 seconds for the + # inference statistics to be ready. + num_tries = 10 + for i in range(num_tries): + stats = self.triton_client_.get_inference_statistics(model_name, "1") + self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") + actual_exec_cnt = stats.model_stats[0].execution_count + if actual_exec_cnt in exec_count: + break + print( + "WARNING: expect {} executions, got {} (attempt {})".format( + exec_count, actual_exec_cnt, i + ) + ) + time.sleep(1) + self.assertEqual( - stats.model_stats[0].version, "1", - "expect model stats for model {} version 1".format(model_name)) + stats.model_stats[0].name, + model_name, + "expect model stats for model {}".format(model_name), + ) + self.assertEqual( + stats.model_stats[0].version, + "1", + "expect model stats for model {} version 1".format(model_name), + ) if batch_exec: batch_stats = stats.model_stats[0].batch_stats self.assertEqual( - len(batch_stats), len(batch_exec), + len(batch_stats), + len(batch_exec), "expected {} different batch-sizes, got {}".format( - len(batch_exec), len(batch_stats))) + len(batch_exec), len(batch_stats) + ), + ) for batch_stat in batch_stats: bs = batch_stat.batch_size bc = batch_stat.compute_infer.count - self.assertTrue(bs in batch_exec, - "unexpected batch-size {}".format(bs)) + self.assertTrue(bs in batch_exec, "unexpected batch-size {}".format(bs)) # Get count from one of the stats self.assertEqual( - bc, batch_exec[bs], - "expected model-execution-count {} for batch size {}, got {}". - format(batch_exec[bs], bs, bc)) + bc, + batch_exec[bs], + "expected model-execution-count {} for batch size {}, got {}".format( + batch_exec[bs], bs, bc + ), + ) actual_request_cnt = stats.model_stats[0].inference_stats.success.count self.assertEqual( - actual_request_cnt, request_cnt, + actual_request_cnt, + request_cnt, "expected model-request-count {}, got {}".format( - request_cnt, actual_request_cnt)) + request_cnt, actual_request_cnt + ), + ) actual_exec_cnt = stats.model_stats[0].execution_count self.assertIn( - actual_exec_cnt, exec_count, - "expected model-exec-count {}, got {}".format( - request_cnt, actual_exec_cnt)) + actual_exec_cnt, + exec_count, + "expected model-exec-count {}, got {}".format(exec_count, actual_exec_cnt), + ) actual_infer_cnt = stats.model_stats[0].inference_count self.assertEqual( - actual_infer_cnt, infer_cnt, + actual_infer_cnt, + infer_cnt, "expected model-inference-count {}, got {}".format( - infer_cnt, actual_infer_cnt)) + infer_cnt, actual_infer_cnt + ), + ) def test_static_batch_preferred(self): # Send two requests with static batch sizes == preferred @@ -250,20 +303,25 @@ def test_static_batch_preferred(self): precreated_shm_regions = self.create_advance() for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) self.check_response( trial, - 2, (3000, None), - precreated_shm_regions=precreated_shm_regions) + 2, + (3000, None), + precreated_shm_regions=precreated_shm_regions, + ) self.check_response( trial, - 6, (3000, None), - precreated_shm_regions=precreated_shm_regions) + 6, + (3000, None), + precreated_shm_regions=precreated_shm_regions, + ) self.check_deferred_exception() self.check_status(model_name, {2: 1, 6: 1}, 2, 8, (2,)) except Exception as ex: @@ -276,16 +334,19 @@ def test_static_batch_lt_any_preferred(self): precreated_shm_regions = self.create_advance() for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) self.check_response( trial, - 1, (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), - precreated_shm_regions=precreated_shm_regions) + 1, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + precreated_shm_regions=precreated_shm_regions, + ) self.check_deferred_exception() self.check_status(model_name, {1: 1}, 1, 1, (1,)) except Exception as ex: @@ -298,16 +359,19 @@ def test_static_batch_not_preferred(self): precreated_shm_regions = self.create_advance() for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) self.check_response( trial, - 3, (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), - precreated_shm_regions=precreated_shm_regions) + 3, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + precreated_shm_regions=precreated_shm_regions, + ) self.check_deferred_exception() self.check_status(model_name, {3: 1}, 1, 3, (1,)) except Exception as ex: @@ -320,16 +384,19 @@ def test_static_batch_gt_max_preferred(self): precreated_shm_regions = self.create_advance() for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) self.check_response( trial, - 7, (3000, None), - precreated_shm_regions=precreated_shm_regions) + 7, + (3000, None), + precreated_shm_regions=precreated_shm_regions, + ) self.check_deferred_exception() self.check_status(model_name, {7: 1}, 1, 7, (1,)) except Exception as ex: @@ -350,25 +417,29 @@ def test_multi_batch_different_shape_allow_ragged(self): threads = [] threads.append( - threading.Thread(target=iu.infer_zero, - args=(self, trial, 1, dtype, ([1, 16],), - ([1, 16],)), - kwargs={ - 'use_grpc': USE_GRPC, - 'use_http': USE_HTTP, - 'use_http_json_tensors': False, - 'use_streaming': False - })) - threads.append( - threading.Thread(target=iu.infer_zero, - args=(self, trial, 1, dtype, ([1, 8],), - ([1, 8],)), - kwargs={ - 'use_grpc': USE_GRPC, - 'use_http': USE_HTTP, - 'use_http_json_tensors': False, - 'use_streaming': False - })) + threading.Thread( + target=iu.infer_zero, + args=(self, trial, 1, dtype, ([1, 16],), ([1, 16],)), + kwargs={ + "use_grpc": USE_GRPC, + "use_http": USE_HTTP, + "use_http_json_tensors": False, + "use_streaming": False, + }, + ) + ) + threads.append( + threading.Thread( + target=iu.infer_zero, + args=(self, trial, 1, dtype, ([1, 8],), ([1, 8],)), + kwargs={ + "use_grpc": USE_GRPC, + "use_http": USE_HTTP, + "use_http_json_tensors": False, + "use_streaming": False, + }, + ) + ) threads[0].start() threads[1].start() for t in threads: @@ -386,17 +457,18 @@ def test_multi_batch_different_shape(self): # immediately and the second delayed by the max batch queue # delay if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) @@ -407,20 +479,27 @@ def test_multi_batch_different_shape(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'input_size': 16, - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "input_size": 16, + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, - args=(trial, 1, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 1, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'input_size': 8, - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "input_size": 8, + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() time.sleep(1) threads[1].start() @@ -438,17 +517,18 @@ def test_multi_batch_not_preferred(self): # delay (minus the difference in time that they arrived in the # queue) if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) @@ -457,21 +537,31 @@ def test_multi_batch_not_preferred(self): threads.append( threading.Thread( target=self.check_response, - args=(trial, 1, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 1, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, - args=(trial, 3, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms - 2000)), + args=( + trial, + 3, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms - 2000), + ), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() time.sleep(1) threads[1].start() @@ -489,20 +579,21 @@ def test_multi_batch_not_preferred_different_shape(self): # two requests to be immediately responded to and the third # response to be delayed by the max batch queue delay. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) @@ -513,27 +604,36 @@ def test_multi_batch_not_preferred_different_shape(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 3, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, - args=(trial, 1, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 1, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'input_size': 8, - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "input_size": 8, + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads[0].start() threads[1].start() time.sleep(1) @@ -554,23 +654,24 @@ def test_multi_batch_preferred_different_shape(self): # preferred size so that third and forth response are sent # immediately. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] - shm3_region_names = ['ip30', 'ip31', 'op30', 'op31'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] + shm3_region_names = ["ip30", "ip31", "op30", "op31"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None shm3_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) - precreated_shm3_regions = self.create_advance(['op30', 'op31']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) + precreated_shm3_regions = self.create_advance(["op30", "op31"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) @@ -581,35 +682,43 @@ def test_multi_batch_preferred_different_shape(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 3, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'input_size': 8, - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "input_size": 8, + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 5, (6000, None)), kwargs={ - 'input_size': 8, - 'shm_region_names': shm3_region_names, - 'precreated_shm_regions': precreated_shm3_regions - })) + "input_size": 8, + "shm_region_names": shm3_region_names, + "precreated_shm_regions": precreated_shm3_regions, + }, + ) + ) threads[0].start() threads[1].start() time.sleep(1) @@ -629,17 +738,18 @@ def test_multi_batch_gt_max_preferred(self): # be processed by the dynamic batcher. This should cause both # responses to be returned immediately. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) @@ -650,17 +760,21 @@ def test_multi_batch_gt_max_preferred(self): target=self.check_response, args=(trial, 3, (3000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 7, (3000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() time.sleep(1) threads[1].start() @@ -681,17 +795,18 @@ def test_multi_batch_sum_gt_max_preferred(self): # since it alone is not greater than max preferred size, will # be delayed. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) @@ -702,18 +817,25 @@ def test_multi_batch_sum_gt_max_preferred(self): target=self.check_response, args=(trial, 3, (3000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, - args=(trial, 4, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 4, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() time.sleep(1) threads[1].start() @@ -729,17 +851,18 @@ def test_multi_same_output0(self): # batched and get the correct response even though they don't # request both outputs. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00'] - shm1_region_names = ['ip10', 'ip11', 'op10'] + shm0_region_names = ["ip00", "ip01", "op00"] + shm1_region_names = ["ip10", "ip11", "op10"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op00']) - precreated_shm1_regions = self.create_advance(['op10']) + precreated_shm0_regions = self.create_advance(["op00"]) + precreated_shm1_regions = self.create_advance(["op10"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) @@ -751,19 +874,23 @@ def test_multi_same_output0(self): target=self.check_response, args=(trial, 1, (3000, None)), kwargs={ - 'requested_outputs': ("OUTPUT0",), - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "requested_outputs": ("OUTPUT0",), + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (3000, None)), kwargs={ - 'requested_outputs': ("OUTPUT0",), - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "requested_outputs": ("OUTPUT0",), + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() threads[1].start() for t in threads: @@ -778,17 +905,18 @@ def test_multi_same_output1(self): # batched and get the correct response even though they don't # request both outputs. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op11'] + shm0_region_names = ["ip00", "ip01", "op01"] + shm1_region_names = ["ip10", "ip11", "op11"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op01']) - precreated_shm1_regions = self.create_advance(['op11']) + precreated_shm0_regions = self.create_advance(["op01"]) + precreated_shm1_regions = self.create_advance(["op11"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) @@ -800,19 +928,23 @@ def test_multi_same_output1(self): target=self.check_response, args=(trial, 1, (3000, None)), kwargs={ - 'requested_outputs': ("OUTPUT1",), - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "requested_outputs": ("OUTPUT1",), + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (3000, None)), kwargs={ - 'requested_outputs': ("OUTPUT1",), - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "requested_outputs": ("OUTPUT1",), + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() threads[1].start() for t in threads: @@ -828,17 +960,18 @@ def test_multi_different_outputs(self): # batched and get the correct response even though they don't # request both outputs. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00'] - shm1_region_names = ['ip10', 'ip11', 'op11'] + shm0_region_names = ["ip00", "ip01", "op00"] + shm1_region_names = ["ip10", "ip11", "op11"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op00']) - precreated_shm1_regions = self.create_advance(['op11']) + precreated_shm0_regions = self.create_advance(["op00"]) + precreated_shm1_regions = self.create_advance(["op11"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) @@ -850,19 +983,23 @@ def test_multi_different_outputs(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'requested_outputs': ("OUTPUT0",), - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "requested_outputs": ("OUTPUT0",), + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'requested_outputs': ("OUTPUT1",), - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "requested_outputs": ("OUTPUT1",), + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() threads[1].start() for t in threads: @@ -877,15 +1014,16 @@ def test_multi_different_output_order(self): # different order. They should be batched and get the correct # response even though they use different order. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op11', 'op10'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op11", "op10"] else: shm0_region_names = None shm1_region_names = None for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) @@ -893,21 +1031,25 @@ def test_multi_different_output_order(self): threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(trial, 1, (6000, None)), - kwargs={ - 'requested_outputs': - ("OUTPUT0", "OUTPUT1"), - 'shm_region_names': shm0_region_names - })) - threads.append( - threading.Thread(target=self.check_response, - args=(trial, 1, (6000, None)), - kwargs={ - 'requested_outputs': - ("OUTPUT1", "OUTPUT0"), - 'shm_region_names': shm1_region_names - })) + threading.Thread( + target=self.check_response, + args=(trial, 1, (6000, None)), + kwargs={ + "requested_outputs": ("OUTPUT0", "OUTPUT1"), + "shm_region_names": shm0_region_names, + }, + ) + ) + threads.append( + threading.Thread( + target=self.check_response, + args=(trial, 1, (6000, None)), + kwargs={ + "requested_outputs": ("OUTPUT1", "OUTPUT0"), + "shm_region_names": shm1_region_names, + }, + ) + ) threads[0].start() threads[1].start() for t in threads: @@ -927,24 +1069,24 @@ def test_multi_batch_delayed_sum_gt_max_preferred(self): # immediately but the second response, since it alone is not # greater than max preferred size, will be delayed. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] else: shm0_region_names = None shm1_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) # Need scheduler to wait for queue to contain 2 requests self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) threads = [] threads.append( @@ -952,18 +1094,25 @@ def test_multi_batch_delayed_sum_gt_max_preferred(self): target=self.check_response, args=(trial, 3, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, - args=(trial, 4, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 4, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads[0].start() time.sleep(1) threads[1].start() @@ -977,7 +1126,7 @@ def test_multi_batch_delayed_sum_gt_max_preferred(self): def test_multi_batch_delayed_use_max_batch(self): # Send three requests with first not having preferred size, # second being smaller than max preferred size but the sum of - # the requests being larger than max preferred size and thrid + # the requests being larger than max preferred size and third # is sent after the first two requests exceeds the queue delay # and the sum of the requests to be in full batch. Use # TRITONSERVER_DELAY_SCHEDULER in the environment so that @@ -986,55 +1135,67 @@ def test_multi_batch_delayed_use_max_batch(self): # while it appears that the first two responses to be returned # after being delayed and the third response to be returned immediately. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) # Need scheduler to wait for queue to contain 3 requests self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3) threads = [] threads.append( threading.Thread( target=self.check_response, - args=(trial, 3, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 3, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, - args=(trial, 4, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 4, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads[0].start() threads[1].start() time.sleep(11) @@ -1057,30 +1218,30 @@ def test_multi_batch_delayed_preferred_different_shape(self): # shape as the third that causes a preferred size so that # third and forth response are sent immediately. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] - shm3_region_names = ['ip30', 'ip31', 'op30', 'op31'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] + shm3_region_names = ["ip30", "ip31", "op30", "op31"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None shm3_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) - precreated_shm3_regions = self.create_advance(['op30', 'op31']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) + precreated_shm3_regions = self.create_advance(["op30", "op31"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) # Need scheduler to wait for queue to contain 4 requests self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4) threads = [] threads.append( @@ -1088,35 +1249,43 @@ def test_multi_batch_delayed_preferred_different_shape(self): target=self.check_response, args=(trial, 1, (3000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 3, (3000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (3000, None)), kwargs={ - 'input_size': 8, - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "input_size": 8, + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 5, (3000, None)), kwargs={ - 'input_size': 8, - 'shm_region_names': shm3_region_names, - 'precreated_shm_regions': precreated_shm3_regions - })) + "input_size": 8, + "shm_region_names": shm3_region_names, + "precreated_shm_regions": precreated_shm3_regions, + }, + ) + ) threads[0].start() threads[1].start() time.sleep(1) @@ -1136,12 +1305,12 @@ def test_multi_batch_use_biggest_preferred(self): # that requests can be queued up before scheduler starts # servicing. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] - shm3_region_names = ['ip30', 'ip31', 'op30', 'op31'] - shm4_region_names = ['ip40', 'ip41', 'op40', 'op41'] - shm5_region_names = ['ip50', 'ip51', 'op50', 'op51'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] + shm3_region_names = ["ip30", "ip31", "op30", "op31"] + shm4_region_names = ["ip40", "ip41", "op40", "op41"] + shm5_region_names = ["ip50", "ip51", "op50", "op51"] else: shm0_region_names = None shm1_region_names = None @@ -1149,23 +1318,23 @@ def test_multi_batch_use_biggest_preferred(self): shm3_region_names = None shm4_region_names = None shm5_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) - precreated_shm3_regions = self.create_advance(['op30', 'op31']) - precreated_shm4_regions = self.create_advance(['op40', 'op41']) - precreated_shm5_regions = self.create_advance(['op50', 'op51']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) + precreated_shm3_regions = self.create_advance(["op30", "op31"]) + precreated_shm4_regions = self.create_advance(["op40", "op41"]) + precreated_shm5_regions = self.create_advance(["op50", "op51"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) # Need scheduler to wait for queue to contain 6 request self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 6) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 6) threads = [] threads.append( @@ -1173,49 +1342,61 @@ def test_multi_batch_use_biggest_preferred(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm3_region_names, - 'precreated_shm_regions': precreated_shm3_regions - })) + "shm_region_names": shm3_region_names, + "precreated_shm_regions": precreated_shm3_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm4_region_names, - 'precreated_shm_regions': precreated_shm4_regions - })) + "shm_region_names": shm4_region_names, + "precreated_shm_regions": precreated_shm4_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm5_region_names, - 'precreated_shm_regions': precreated_shm5_regions - })) + "shm_region_names": shm5_region_names, + "precreated_shm_regions": precreated_shm5_regions, + }, + ) + ) for t in threads: t.start() for t in threads: @@ -1234,27 +1415,27 @@ def test_multi_batch_use_best_preferred(self): # that requests can be queued up before scheduler starts # servicing. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [2, 6], _max_queue_delay_ms * 1000) # Need scheduler to wait for queue to contain 3 requests self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3) threads = [] threads.append( @@ -1262,26 +1443,35 @@ def test_multi_batch_use_best_preferred(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, - args=(trial, 1, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), + args=( + trial, + 1, + (_max_queue_delay_ms * 1.5, _max_queue_delay_ms), + ), kwargs={ - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads[0].start() threads[1].start() time.sleep(1) @@ -1296,41 +1486,36 @@ def test_multi_batch_use_best_preferred(self): def test_multi_batch_preserve_ordering(self): model_base = "custom" dtype = np.float32 - shapes = ([ - 1, - 1, - ],) + shapes = ( + [ + 1, + 1, + ], + ) try: # use threads to send 12 requests without waiting for response threads = [] for i in range(12): if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm_region_name_prefix = [ - "input" + str(i), "output" + str(i) - ] + shm_region_name_prefix = ["input" + str(i), "output" + str(i)] else: shm_region_name_prefix = None threads.append( - threading.Thread(target=iu.infer_zero, - args=(self, model_base, 1, dtype, shapes, - shapes), - kwargs={ - 'use_grpc': - USE_GRPC, - 'use_http': - USE_HTTP, - 'use_http_json_tensors': - False, - 'use_streaming': - False, - 'shm_region_name_prefix': - shm_region_name_prefix, - 'use_system_shared_memory': - TEST_SYSTEM_SHARED_MEMORY, - 'use_cuda_shared_memory': - TEST_CUDA_SHARED_MEMORY - })) + threading.Thread( + target=iu.infer_zero, + args=(self, model_base, 1, dtype, shapes, shapes), + kwargs={ + "use_grpc": USE_GRPC, + "use_http": USE_HTTP, + "use_http_json_tensors": False, + "use_streaming": False, + "shm_region_name_prefix": shm_region_name_prefix, + "use_system_shared_memory": TEST_SYSTEM_SHARED_MEMORY, + "use_cuda_shared_memory": TEST_CUDA_SHARED_MEMORY, + }, + ) + ) for t in threads: t.start() for t in threads: @@ -1348,30 +1533,30 @@ def test_preferred_batch_only_aligned(self): # servicing. The batcher should form a batch of preferred # size 4. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] - shm3_region_names = ['ip30', 'ip31', 'op30', 'op31'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] + shm3_region_names = ["ip30", "ip31", "op30", "op31"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None shm3_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) - precreated_shm3_regions = self.create_advance(['op30', 'op31']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) + precreated_shm3_regions = self.create_advance(["op30", "op31"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [4, 6], 0) # Need scheduler to wait for queue to contain 4 requests self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4) threads = [] threads.append( @@ -1379,33 +1564,41 @@ def test_preferred_batch_only_aligned(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm3_region_names, - 'precreated_shm_regions': precreated_shm3_regions - })) + "shm_region_names": shm3_region_names, + "precreated_shm_regions": precreated_shm3_regions, + }, + ) + ) for t in threads: t.start() for t in threads: @@ -1422,33 +1615,33 @@ def test_preferred_batch_only_unaligned(self): # servicing. The batcher should form a batch of preferred # size 4 followed by a batch of size 1. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] - shm3_region_names = ['ip30', 'ip31', 'op30', 'op31'] - shm4_region_names = ['ip40', 'ip41', 'op40', 'op41'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] + shm3_region_names = ["ip30", "ip31", "op30", "op31"] + shm4_region_names = ["ip40", "ip41", "op40", "op41"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None shm3_region_names = None shm4_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) - precreated_shm3_regions = self.create_advance(['op30', 'op31']) - precreated_shm4_regions = self.create_advance(['op40', 'op41']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) + precreated_shm3_regions = self.create_advance(["op30", "op31"]) + precreated_shm4_regions = self.create_advance(["op40", "op41"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [4, 6], 0) # Need scheduler to wait for queue to contain 3 requests self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 5) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 5) threads = [] threads.append( @@ -1456,41 +1649,51 @@ def test_preferred_batch_only_unaligned(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm3_region_names, - 'precreated_shm_regions': precreated_shm3_regions - })) + "shm_region_names": shm3_region_names, + "precreated_shm_regions": precreated_shm3_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm4_region_names, - 'precreated_shm_regions': precreated_shm4_regions - })) + "shm_region_names": shm4_region_names, + "precreated_shm_regions": precreated_shm4_regions, + }, + ) + ) for t in threads: t.start() for t in threads: @@ -1507,13 +1710,13 @@ def test_preferred_batch_only_use_biggest_preferred(self): # servicing. The batcher should form a batch of largest preferred # size 6 followed by a batch of size 1. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] - shm3_region_names = ['ip30', 'ip31', 'op30', 'op31'] - shm4_region_names = ['ip40', 'ip41', 'op40', 'op41'] - shm5_region_names = ['ip50', 'ip51', 'op50', 'op51'] - shm6_region_names = ['ip60', 'ip61', 'op60', 'op61'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] + shm3_region_names = ["ip30", "ip31", "op30", "op31"] + shm4_region_names = ["ip40", "ip41", "op40", "op41"] + shm5_region_names = ["ip50", "ip51", "op50", "op51"] + shm6_region_names = ["ip60", "ip61", "op60", "op61"] else: shm0_region_names = None shm1_region_names = None @@ -1522,24 +1725,24 @@ def test_preferred_batch_only_use_biggest_preferred(self): shm4_region_names = None shm5_region_names = None shm6_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) - precreated_shm3_regions = self.create_advance(['op30', 'op31']) - precreated_shm4_regions = self.create_advance(['op40', 'op41']) - precreated_shm5_regions = self.create_advance(['op50', 'op51']) - precreated_shm6_regions = self.create_advance(['op60', 'op61']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) + precreated_shm3_regions = self.create_advance(["op30", "op31"]) + precreated_shm4_regions = self.create_advance(["op40", "op41"]) + precreated_shm5_regions = self.create_advance(["op50", "op51"]) + precreated_shm6_regions = self.create_advance(["op60", "op61"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [4, 6], 0) # Need scheduler to wait for queue to contain 6 request self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 7) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 7) threads = [] threads.append( @@ -1547,57 +1750,71 @@ def test_preferred_batch_only_use_biggest_preferred(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm3_region_names, - 'precreated_shm_regions': precreated_shm3_regions - })) + "shm_region_names": shm3_region_names, + "precreated_shm_regions": precreated_shm3_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm4_region_names, - 'precreated_shm_regions': precreated_shm4_regions - })) + "shm_region_names": shm4_region_names, + "precreated_shm_regions": precreated_shm4_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm5_region_names, - 'precreated_shm_regions': precreated_shm5_regions - })) + "shm_region_names": shm5_region_names, + "precreated_shm_regions": precreated_shm5_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm6_region_names, - 'precreated_shm_regions': precreated_shm6_regions - })) + "shm_region_names": shm6_region_names, + "precreated_shm_regions": precreated_shm6_regions, + }, + ) + ) for t in threads: t.start() for t in threads: @@ -1613,27 +1830,27 @@ def test_preferred_batch_only_use_no_preferred_size(self): # requests can be queued up before scheduler starts # servicing. The batcher should form a batch of of 3. if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm0_region_names = ['ip00', 'ip01', 'op00', 'op01'] - shm1_region_names = ['ip10', 'ip11', 'op10', 'op11'] - shm2_region_names = ['ip20', 'ip21', 'op20', 'op21'] + shm0_region_names = ["ip00", "ip01", "op00", "op01"] + shm1_region_names = ["ip10", "ip11", "op10", "op11"] + shm2_region_names = ["ip20", "ip21", "op20", "op21"] else: shm0_region_names = None shm1_region_names = None shm2_region_names = None - precreated_shm0_regions = self.create_advance(['op00', 'op01']) - precreated_shm1_regions = self.create_advance(['op10', 'op11']) - precreated_shm2_regions = self.create_advance(['op20', 'op21']) + precreated_shm0_regions = self.create_advance(["op00", "op01"]) + precreated_shm1_regions = self.create_advance(["op10", "op11"]) + precreated_shm2_regions = self.create_advance(["op20", "op21"]) for trial in _trials: try: - model_name = tu.get_model_name(trial, np.float32, np.float32, - np.float32) + model_name = tu.get_model_name( + trial, np.float32, np.float32, np.float32 + ) self.check_setup(model_name, [4, 6], 0) # Need scheduler to wait for queue to contain 3 request self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 3) threads = [] threads.append( @@ -1641,25 +1858,31 @@ def test_preferred_batch_only_use_no_preferred_size(self): target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm0_region_names, - 'precreated_shm_regions': precreated_shm0_regions - })) + "shm_region_names": shm0_region_names, + "precreated_shm_regions": precreated_shm0_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm1_region_names, - 'precreated_shm_regions': precreated_shm1_regions - })) + "shm_region_names": shm1_region_names, + "precreated_shm_regions": precreated_shm1_regions, + }, + ) + ) threads.append( threading.Thread( target=self.check_response, args=(trial, 1, (6000, None)), kwargs={ - 'shm_region_names': shm2_region_names, - 'precreated_shm_regions': precreated_shm2_regions - })) + "shm_region_names": shm2_region_names, + "precreated_shm_regions": precreated_shm2_regions, + }, + ) + ) for t in threads: t.start() for t in threads: @@ -1675,48 +1898,43 @@ def test_max_queue_delay_only_non_default(self): # there can be either 1 or 2 model executions. model_base = "custom" dtype = np.float32 - shapes = ([ - 1, - 1, - ],) + shapes = ( + [ + 1, + 1, + ], + ) try: # use threads to send 12 requests without waiting for response threads = [] for i in range(12): if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm_region_name_prefix = [ - "input" + str(i), "output" + str(i) - ] + shm_region_name_prefix = ["input" + str(i), "output" + str(i)] else: shm_region_name_prefix = None threads.append( - threading.Thread(target=iu.infer_zero, - args=(self, model_base, 1, dtype, shapes, - shapes), - kwargs={ - 'use_grpc': - USE_GRPC, - 'use_http': - USE_HTTP, - 'use_http_json_tensors': - False, - 'use_streaming': - False, - 'shm_region_name_prefix': - shm_region_name_prefix, - 'use_system_shared_memory': - TEST_SYSTEM_SHARED_MEMORY, - 'use_cuda_shared_memory': - TEST_CUDA_SHARED_MEMORY - })) + threading.Thread( + target=iu.infer_zero, + args=(self, model_base, 1, dtype, shapes, shapes), + kwargs={ + "use_grpc": USE_GRPC, + "use_http": USE_HTTP, + "use_http_json_tensors": False, + "use_streaming": False, + "shm_region_name_prefix": shm_region_name_prefix, + "use_system_shared_memory": TEST_SYSTEM_SHARED_MEMORY, + "use_cuda_shared_memory": TEST_CUDA_SHARED_MEMORY, + }, + ) + ) for t in threads: t.start() for t in threads: t.join() self.check_deferred_exception() model_name = tu.get_zero_model_name(model_base, len(shapes), dtype) - self.check_status(model_name, None, 12, 12, (1,2)) + self.check_status(model_name, None, 12, 12, (1, 2)) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -1727,41 +1945,36 @@ def test_max_queue_delay_only_default(self): # and the remaining requests will form the second batch. model_base = "custom" dtype = np.float32 - shapes = ([ - 1, - 1, - ],) + shapes = ( + [ + 1, + 1, + ], + ) try: # use threads to send 12 requests without waiting for response threads = [] for i in range(12): if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: - shm_region_name_prefix = [ - "input" + str(i), "output" + str(i) - ] + shm_region_name_prefix = ["input" + str(i), "output" + str(i)] else: shm_region_name_prefix = None threads.append( - threading.Thread(target=iu.infer_zero, - args=(self, model_base, 1, dtype, shapes, - shapes), - kwargs={ - 'use_grpc': - USE_GRPC, - 'use_http': - USE_HTTP, - 'use_http_json_tensors': - False, - 'use_streaming': - False, - 'shm_region_name_prefix': - shm_region_name_prefix, - 'use_system_shared_memory': - TEST_SYSTEM_SHARED_MEMORY, - 'use_cuda_shared_memory': - TEST_CUDA_SHARED_MEMORY - })) + threading.Thread( + target=iu.infer_zero, + args=(self, model_base, 1, dtype, shapes, shapes), + kwargs={ + "use_grpc": USE_GRPC, + "use_http": USE_HTTP, + "use_http_json_tensors": False, + "use_streaming": False, + "shm_region_name_prefix": shm_region_name_prefix, + "use_system_shared_memory": TEST_SYSTEM_SHARED_MEMORY, + "use_cuda_shared_memory": TEST_CUDA_SHARED_MEMORY, + }, + ) + ) for t in threads: t.start() for t in threads: @@ -1772,5 +1985,6 @@ def test_max_queue_delay_only_default(self): except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) -if __name__ == '__main__': + +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_batcher/test.sh b/qa/L0_batcher/test.sh index d8ab6131f7..c5f8819276 100755 --- a/qa/L0_batcher/test.sh +++ b/qa/L0_batcher/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -74,7 +74,7 @@ if [ "$TEST_VALGRIND" -eq 1 ]; then test_multi_batch_different_shape_allow_ragged" fi -TF_VERSION=${TF_VERSION:=1} +TF_VERSION=${TF_VERSION:=2} # On windows the paths invoked by the script (running in WSL) must use # /mnt/c when needed but the paths on the tritonserver command-line @@ -91,6 +91,14 @@ else TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends + + # PyTorch on SBSA requires libgomp to be loaded first. See the following + # GitHub issue for more information: + # https://github.com/pytorch/pytorch/issues/2575 + arch=`uname -m` + if [ $arch = "aarch64" ]; then + SERVER_LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libgomp.so.1 + fi fi SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}" @@ -99,7 +107,7 @@ source ../common/util.sh RET=0 # If BACKENDS not specified, set to all -BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan"} +BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan python"} export BACKENDS # Basic batcher tests @@ -138,11 +146,21 @@ MAX_QUEUE_DELAY_ONLY_TESTS=${MAX_QUEUE_DELAY_ONLY_TESTS:="test_max_queue_delay_o test_max_queue_delay_only_non_default"} # Setup non-variable-size model repository -rm -fr *.log *.serverlog models && mkdir models +rm -fr *.log models && mkdir models for BACKEND in $BACKENDS; do TMP_MODEL_DIR="$DATADIR/qa_model_repository/${BACKEND}_float32_float32_float32" - - cp -r $TMP_MODEL_DIR models/. && + if [ "$BACKEND" == "python" ]; then + # We will be using ONNX models config.pbtxt and tweak them to make them + # appropriate for Python backend + onnx_model="${DATADIR}/qa_model_repository/onnx_float32_float32_float32" + python_model=`echo $onnx_model | sed 's/onnx/python/g' | sed 's,'"$DATADIR/qa_model_repository/"',,g'` + mkdir -p models/$python_model/1/ + cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > models/$python_model/config.pbtxt + cp $onnx_model/output0_labels.txt models/$python_model + cp ../python_models/add_sub/model.py models/$python_model/1/ + else + cp -r $TMP_MODEL_DIR models/. + fi (cd models/$(basename $TMP_MODEL_DIR) && \ sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \ sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" config.pbtxt && \ @@ -152,8 +170,18 @@ done rm -fr preferred_batch_only_models && mkdir preferred_batch_only_models for BACKEND in $BACKENDS; do TMP_MODEL_DIR="$DATADIR/qa_model_repository/${BACKEND}_float32_float32_float32" - - cp -r $TMP_MODEL_DIR preferred_batch_only_models/. && + if [ "$BACKEND" == "python" ]; then + # We will be using ONNX models config.pbtxt and tweak them to make them + # appropriate for Python backend + onnx_model="${DATADIR}/qa_model_repository/onnx_float32_float32_float32" + python_model=`echo $onnx_model | sed 's/onnx/python/g' | sed 's,'"$DATADIR/qa_model_repository/"',,g'` + mkdir -p preferred_batch_only_models/$python_model/1/ + cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > preferred_batch_only_models/$python_model/config.pbtxt + cp $onnx_model/output0_labels.txt preferred_batch_only_models/$python_model + cp ../python_models/add_sub/model.py preferred_batch_only_models/$python_model/1/ + else + cp -r $TMP_MODEL_DIR preferred_batch_only_models/. + fi (cd preferred_batch_only_models/$(basename $TMP_MODEL_DIR) && \ sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \ sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" config.pbtxt && \ @@ -164,14 +192,22 @@ done rm -fr var_models && mkdir var_models for BACKEND in $BACKENDS; do TMP_MODEL_DIR="$DATADIR/qa_variable_model_repository/${BACKEND}_float32_float32_float32" - - for TMP_DIR in $TMP_MODEL_DIR; do - cp -r $TMP_DIR var_models/. && - (cd var_models/$(basename $TMP_DIR) && \ + if [ "$BACKEND" == "python" ]; then + # We will be using ONNX models config.pbtxt and tweak them to make them + # appropriate for Python backend + onnx_model="${DATADIR}/qa_variable_model_repository/onnx_float32_float32_float32" + python_model=`echo $onnx_model | sed 's/onnx/python/g' | sed 's,'"$DATADIR/qa_variable_model_repository/"',,g'` + mkdir -p var_models/$python_model/1/ + cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > var_models/$python_model/config.pbtxt + cp $onnx_model/output0_labels.txt var_models/$python_model + cp ../python_models/add_sub/model.py var_models/$python_model/1/ + else + cp -r $TMP_MODEL_DIR var_models/. + fi + (cd var_models/$(basename $TMP_MODEL_DIR) && \ sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \ sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" config.pbtxt && \ echo "dynamic_batching { preferred_batch_size: [ 2, 6 ], max_queue_delay_microseconds: 10000000 }" >> config.pbtxt) - done done for MC in `ls var_models/*/config.pbtxt`; do @@ -214,6 +250,19 @@ if [[ $BACKENDS == *"onnx"* ]]; then dynamic_batching { preferred_batch_size: [ 2, 6 ], max_queue_delay_microseconds: 10000000 }" >> config.pbtxt) fi +if [[ $BACKENDS == *"libtorch"* ]]; then + # Use nobatch model to match the ragged test requirement + cp -r $DATADIR/qa_identity_model_repository/libtorch_nobatch_zero_1_float32 var_models/libtorch_zero_1_float32 && \ + (cd var_models/libtorch_zero_1_float32 && \ + sed -i "s/nobatch_//" config.pbtxt && \ + sed -i "s/^max_batch_size:.*/max_batch_size: 8/" config.pbtxt && \ + sed -i "s/name: \"INPUT__0\"/name: \"INPUT__0\"\\nallow_ragged_batch: true/" config.pbtxt && \ + echo "batch_output [{target_name: \"OUTPUT__0\" \ + kind: BATCH_SCATTER_WITH_INPUT_SHAPE \ + source_input: \"INPUT__0\" }] \ + dynamic_batching { preferred_batch_size: [ 2, 6 ], max_queue_delay_microseconds: 10000000 }" >> config.pbtxt) +fi + # Need to launch the server for each test so that the model status is # reset (which is used to make sure the correctly batch size was used # for execution). Test everything with fixed-tensor-size models and @@ -224,7 +273,7 @@ for model_type in FIXED VARIABLE; do MODEL_PATH=models && [[ "$model_type" == "VARIABLE" ]] && MODEL_PATH=var_models for i in $NO_DELAY_TESTS ; do SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.$model_type.serverlog" + SERVER_LOG="./$i.$model_type.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.$model_type.valgrind.log" @@ -277,7 +326,7 @@ for model_type in FIXED VARIABLE; do [[ "$i" != "test_multi_batch_use_best_preferred" ]] && [[ "$i" != "test_multi_batch_delayed_use_max_batch" ]] && export TRITONSERVER_DELAY_SCHEDULER=2 SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.$model_type.serverlog" + SERVER_LOG="./$i.$model_type.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.$model_type.valgrind.log" @@ -327,7 +376,7 @@ done export BATCHER_TYPE=VARIABLE for i in $DIFFERENT_SHAPE_TESTS ; do SERVER_ARGS="--model-repository=$MODELDIR/var_models ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.VARIABLE.serverlog" + SERVER_LOG="./$i.VARIABLE.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.VARIABLE.valgrind.log" @@ -380,7 +429,7 @@ for i in \ test_multi_batch_delayed_preferred_different_shape ; do export TRITONSERVER_DELAY_SCHEDULER=4 SERVER_ARGS="--model-repository=$MODELDIR/var_models ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.VARIABLE.serverlog" + SERVER_LOG="./$i.VARIABLE.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.VARIABLE.valgrind.log" @@ -433,7 +482,7 @@ for i in $PREFERRED_BATCH_ONLY_TESTS ; do [[ "$i" != "test_preferred_batch_only_unaligned" ]] && export TRITONSERVER_DELAY_SCHEDULER=7 && [[ "$i" != "test_preferred_batch_only_use_biggest_preferred" ]] && export TRITONSERVER_DELAY_SCHEDULER=3 SERVER_ARGS="--model-repository=$MODELDIR/preferred_batch_only_models ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.PREFERRED_BATCH_ONLY.serverlog" + SERVER_LOG="./$i.PREFERRED_BATCH_ONLY.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.PREFERRED_BATCH_ONLY.valgrind.log" @@ -502,7 +551,7 @@ for i in $MAX_QUEUE_DELAY_ONLY_TESTS ; do sed -i "s/max_queue_delay_microseconds:.*\[.*\]/max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_MICROSECONDS}/g" config.pbtxt ) SERVER_ARGS="--model-repository=$MODELDIR/custom_models ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.MAX_QUEUE_DELAY_ONLY.serverlog" + SERVER_LOG="./$i.MAX_QUEUE_DELAY_ONLY.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.MAX_QUEUE_DELAY_ONLY.valgrind.log" @@ -580,7 +629,7 @@ if [[ "$(< /proc/sys/kernel/osrelease)" != *microsoft* ]]; then # not preserve SERVER_ARGS="--trace-file=not_preserve.log --trace-level=MIN --trace-rate=1 --model-repository=$MODELDIR/custom_models ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./not_preserve.serverlog" + SERVER_LOG="./not_preserve.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./not_preserve.valgrind.log" @@ -635,7 +684,7 @@ if [[ "$(< /proc/sys/kernel/osrelease)" != *microsoft* ]]; then sed -i "s/dynamic_batching.*/dynamic_batching { preferred_batch_size: [ 4 ] preserve_ordering: true }/g" config.pbtxt) SERVER_ARGS="--trace-file=preserve.log --trace-level=MIN --trace-rate=1 --model-repository=$MODELDIR/custom_models ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./preserve.serverlog" + SERVER_LOG="./preserve.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./preserve.valgrind.log" @@ -695,3 +744,4 @@ else fi exit $RET + diff --git a/qa/L0_batcher/verify_timestamps.py b/qa/L0_batcher/verify_timestamps.py old mode 100644 new mode 100755 index 30aad60fa3..3271135fcd --- a/qa/L0_batcher/verify_timestamps.py +++ b/qa/L0_batcher/verify_timestamps.py @@ -33,7 +33,7 @@ def verify_timestamps(traces, preserve): # Order traces by id - traces = sorted(traces, key=lambda t: t.get('id', -1)) + traces = sorted(traces, key=lambda t: t.get("id", -1)) # Filter the trace that is not meaningful and group them by 'id' filtered_traces = dict() @@ -41,7 +41,7 @@ def verify_timestamps(traces, preserve): for trace in traces: if "id" not in trace: continue - # Skip GRPC traces as actual traces are not genarated via GRPC, + # Skip GRPC traces as actual traces are not generated via GRPC, # thus GRPC traces are ill-formed if "timestamps" in trace: is_grpc = False @@ -53,16 +53,16 @@ def verify_timestamps(traces, preserve): grpc_id_offset += 1 continue - if (trace['id'] in filtered_traces.keys()): - rep_trace = filtered_traces[trace['id']] - # Apend the timestamp to the trace representing this 'id' + if trace["id"] in filtered_traces.keys(): + rep_trace = filtered_traces[trace["id"]] + # Append the timestamp to the trace representing this 'id' if "timestamps" in trace: rep_trace["timestamps"] += trace["timestamps"] else: # Use this trace to represent this 'id' if "timestamps" not in trace: trace["timestamps"] = [] - filtered_traces[trace['id']] = trace + filtered_traces[trace["id"]] = trace # First find the latest response complete timestamp for the batch with large delay large_delay_response_complete = 0 @@ -75,10 +75,11 @@ def verify_timestamps(traces, preserve): compute_span = timestamps["COMPUTE_END"] - timestamps["COMPUTE_START"] # If the 3rd batch is also processed by large delay instance, we don't # want to use its responses as baseline - if trace["id"] <= ( - 8 + grpc_id_offset) and compute_span >= 400 * 1000 * 1000: + if trace["id"] <= (8 + grpc_id_offset) and compute_span >= 400 * 1000 * 1000: response_complete = timestamps["INFER_RESPONSE_COMPLETE"] - large_delay_response_complete = max(large_delay_response_complete, response_complete) + large_delay_response_complete = max( + large_delay_response_complete, response_complete + ) else: small_delay_traces.append(trace) @@ -92,8 +93,11 @@ def verify_timestamps(traces, preserve): response_request_after_large_delay_count += 1 # Hardcoded expected count here - print("responses after large delay count: {}".format( - response_request_after_large_delay_count)) + print( + "responses after large delay count: {}".format( + response_request_after_large_delay_count + ) + ) if preserve: # If preserve ordering, there must be large delay batch followed by # small delay batch and thus at least 4 responses are sent after @@ -103,15 +107,18 @@ def verify_timestamps(traces, preserve): # before large delay batch regardless of the ordering in scheduler return 0 if response_request_after_large_delay_count == 0 else 1 -if __name__ == '__main__': + +if __name__ == "__main__": parser = argparse.ArgumentParser() - parser.add_argument('-p', - '--preserve', - action="store_true", - required=False, - default=False, - help='Timestamps is collected with preserve ordering') - parser.add_argument('file', type=argparse.FileType('r'), nargs='+') + parser.add_argument( + "-p", + "--preserve", + action="store_true", + required=False, + default=False, + help="Timestamps is collected with preserve ordering", + ) + parser.add_argument("file", type=argparse.FileType("r"), nargs="+") FLAGS = parser.parse_args() for f in FLAGS.file: diff --git a/qa/L0_buffer_attributes/buffer_attributes_test.py b/qa/L0_buffer_attributes/buffer_attributes_test.py old mode 100644 new mode 100755 index 907a469bab..7d61e082c5 --- a/qa/L0_buffer_attributes/buffer_attributes_test.py +++ b/qa/L0_buffer_attributes/buffer_attributes_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -29,28 +31,26 @@ sys.path.append("../common") import unittest + import numpy as np import test_util as tu - +import tritonclient.grpc as grpcclient +import tritonclient.http as httpclient import tritonclient.utils.cuda_shared_memory as cudashm from tritonclient.utils import triton_to_np_dtype -import tritonclient.http as httpclient -import tritonclient.grpc as grpcclient class BufferAttributesTest(tu.TestResultCollector): - def test_buffer_attributes(self): - model_name = 'bls' + model_name = "bls" # Infer clients = [ - httpclient.InferenceServerClient(url='localhost:8000'), - grpcclient.InferenceServerClient(url='localhost:8001') + httpclient.InferenceServerClient(url="localhost:8000"), + grpcclient.InferenceServerClient(url="localhost:8001"), ] triton_clients = [httpclient, grpcclient] for i, client in enumerate(clients): - # To make sure no shared memory regions are registered with the # server. client.unregister_system_shared_memory() @@ -59,8 +59,7 @@ def test_buffer_attributes(self): triton_client = triton_clients[i] inputs = [] outputs = [] - inputs.append(triton_client.InferInput('INPUT0', [1, 1000], - "INT32")) + inputs.append(triton_client.InferInput("INPUT0", [1, 1000], "INT32")) input0_data = np.arange(start=0, stop=1000, dtype=np.int32) input0_data = np.expand_dims(input0_data, axis=0) @@ -69,45 +68,55 @@ def test_buffer_attributes(self): output_byte_size = input_byte_size shm_ip0_handle = cudashm.create_shared_memory_region( - "input0_data", input_byte_size, 0) + "input0_data", input_byte_size, 0 + ) shm_op0_handle = cudashm.create_shared_memory_region( - "output0_data", output_byte_size, 0) + "output0_data", output_byte_size, 0 + ) client.register_cuda_shared_memory( - "input0_data", cudashm.get_raw_handle(shm_ip0_handle), 0, - input_byte_size) + "input0_data", + cudashm.get_raw_handle(shm_ip0_handle), + 0, + input_byte_size, + ) client.register_cuda_shared_memory( - "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, - input_byte_size) + "output0_data", + cudashm.get_raw_handle(shm_op0_handle), + 0, + input_byte_size, + ) cudashm.set_shared_memory_region(shm_ip0_handle, [input0_data]) inputs[0].set_shared_memory("input0_data", input_byte_size) if triton_client is grpcclient: - outputs.append(triton_client.InferRequestedOutput('OUTPUT0')) + outputs.append(triton_client.InferRequestedOutput("OUTPUT0")) outputs[0].set_shared_memory("output0_data", output_byte_size) else: outputs.append( - triton_client.InferRequestedOutput('OUTPUT0', - binary_data=True)) + triton_client.InferRequestedOutput("OUTPUT0", binary_data=True) + ) outputs[0].set_shared_memory("output0_data", output_byte_size) - results = client.infer(model_name=model_name, - inputs=inputs, - outputs=outputs) + results = client.infer( + model_name=model_name, inputs=inputs, outputs=outputs + ) output0 = results.get_output("OUTPUT0") self.assertIsNotNone(output0) if triton_client is grpcclient: output0_data = cudashm.get_contents_as_numpy( - shm_op0_handle, triton_to_np_dtype(output0.datatype), - output0.shape) + shm_op0_handle, triton_to_np_dtype(output0.datatype), output0.shape + ) else: output0_data = cudashm.get_contents_as_numpy( - shm_op0_handle, triton_to_np_dtype(output0['datatype']), - output0['shape']) + shm_op0_handle, + triton_to_np_dtype(output0["datatype"]), + output0["shape"], + ) self.assertTrue(np.all(output0_data == input0_data)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_buffer_attributes/models/bls/1/model.py b/qa/L0_buffer_attributes/models/bls/1/model.py index c4b5151a1e..2d3e78e936 100644 --- a/qa/L0_buffer_attributes/models/bls/1/model.py +++ b/qa/L0_buffer_attributes/models/bls/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -29,23 +29,26 @@ # Simple Python model that executes a BLS request on an identity model. class TritonPythonModel: - def execute(self, requests): responses = [] for request in requests: # Get INPUT0 - input0 = pb_utils.get_input_tensor_by_name(request, 'INPUT0') + input0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") infer_request = pb_utils.InferenceRequest( - model_name='identity', + model_name="identity", requested_output_names=["OUTPUT0"], - inputs=[input0]) + inputs=[input0], + ) infer_response = infer_request.exec() if infer_response.has_error(): - raise pb_utils.TritonModelException( - infer_response.error().message()) + raise pb_utils.TritonModelException(infer_response.error().message()) - inference_response = pb_utils.InferenceResponse(output_tensors=[pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0')]) + inference_response = pb_utils.InferenceResponse( + output_tensors=[ + pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + ] + ) responses.append(inference_response) return responses diff --git a/qa/L0_buffer_attributes/models/identity/1/model.py b/qa/L0_buffer_attributes/models/identity/1/model.py index 781360b147..2d4b592ae3 100644 --- a/qa/L0_buffer_attributes/models/identity/1/model.py +++ b/qa/L0_buffer_attributes/models/identity/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,7 +28,6 @@ class TritonPythonModel: - def execute(self, requests): """ Identity model using DLPack in Python backend. @@ -36,6 +35,8 @@ def execute(self, requests): responses = [] for request in requests: input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") - out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", input_tensor.to_dlpack()) + out_tensor = pb_utils.Tensor.from_dlpack( + "OUTPUT0", input_tensor.to_dlpack() + ) responses.append(pb_utils.InferenceResponse([out_tensor])) return responses diff --git a/qa/L0_buffer_attributes/test.sh b/qa/L0_buffer_attributes/test.sh old mode 100644 new mode 100755 index 52babf37e2..7e2f35d837 --- a/qa/L0_buffer_attributes/test.sh +++ b/qa/L0_buffer_attributes/test.sh @@ -1,4 +1,5 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/bin/bash +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions diff --git a/qa/L0_client_build_variants/test.sh b/qa/L0_client_build_variants/test.sh index 9c36791144..ab3feb6172 100755 --- a/qa/L0_client_build_variants/test.sh +++ b/qa/L0_client_build_variants/test.sh @@ -31,15 +31,17 @@ apt-get install -y --no-install-recommends \ rapidjson-dev # Client build requires recent version of CMake (FetchContent required) -wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \ - gpg --dearmor - | \ - tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null && \ -apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' && \ -apt-get update && \ -apt-get install -y --no-install-recommends \ -cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1; \ +# Using CMAKE installation instruction from:: https://apt.kitware.com/ +apt update -q=2 \ + && apt install -y gpg wget \ + && wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null \ + && . /etc/os-release \ + && echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $UBUNTU_CODENAME main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null \ + && apt-get update -q=2 \ + && apt-get install -y --no-install-recommends cmake=3.27.7* cmake-data=3.27.7* cmake --version + set +e mkdir -p /workspace/build @@ -62,6 +64,9 @@ mkdir -p /workspace/build -DTRITON_ENABLE_EXAMPLES=ON \ -DTRITON_ENABLE_TESTS=ON \ -DTRITON_ENABLE_GPU=OFF \ + -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \ + -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \ /workspace/client && \ make -j16 cc-clients java-clients python-clients) if [ $? -eq 0 ]; then @@ -90,6 +95,9 @@ fi -DTRITON_ENABLE_EXAMPLES=ON \ -DTRITON_ENABLE_TESTS=ON \ -DTRITON_ENABLE_GPU=ON \ + -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \ + -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \ /workspace/client && \ make -j16 cc-clients python-clients) if [ $? -eq 0 ]; then @@ -117,6 +125,9 @@ fi -DTRITON_ENABLE_EXAMPLES=ON \ -DTRITON_ENABLE_TESTS=ON \ -DTRITON_ENABLE_GPU=ON \ + -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \ + -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \ /workspace/client && \ make -j16 cc-clients python-clients) if [ $? -eq 0 ]; then @@ -143,6 +154,9 @@ fi -DTRITON_ENABLE_EXAMPLES=ON \ -DTRITON_ENABLE_TESTS=ON \ -DTRITON_ENABLE_GPU=ON \ + -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \ + -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \ /workspace/client && \ make -j16 cc-clients python-clients) if [ $? -eq 0 ]; then @@ -169,6 +183,9 @@ fi -DTRITON_ENABLE_EXAMPLES=ON \ -DTRITON_ENABLE_TESTS=ON \ -DTRITON_ENABLE_GPU=ON \ + -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \ + -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \ /workspace/client && \ make -j16 cc-clients python-clients) if [ $? -eq 0 ]; then @@ -195,6 +212,9 @@ fi -DTRITON_ENABLE_EXAMPLES=ON \ -DTRITON_ENABLE_TESTS=ON \ -DTRITON_ENABLE_GPU=ON \ + -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \ + -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \ /workspace/client && \ make -j16 cc-clients python-clients) if [ $? -eq 0 ]; then @@ -221,6 +241,9 @@ fi -DTRITON_ENABLE_EXAMPLES=ON \ -DTRITON_ENABLE_TESTS=ON \ -DTRITON_ENABLE_GPU=ON \ + -DTRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG} \ + -DTRITON_CORE_REPO_TAG=${TRITON_CORE_REPO_TAG} \ + -DTRITON_THIRD_PARTY_REPO_TAG=${TRITON_THIRD_PARTY_REPO_TAG} \ /workspace/client && \ make -j16 cc-clients python-clients) if [ $? -eq 0 ]; then diff --git a/qa/L0_client_java/test.sh b/qa/L0_client_java/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_client_memory_growth/client_memory_mail.py b/qa/L0_client_memory_growth/client_memory_mail.py old mode 100644 new mode 100755 index 53c20f6f9f..ef1703f2c3 --- a/qa/L0_client_memory_growth/client_memory_mail.py +++ b/qa/L0_client_memory_growth/client_memory_mail.py @@ -26,20 +26,25 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys -sys.path.append("../common") -import nightly_email_helper +sys.path.append("../common") import glob from datetime import date -if __name__ == '__main__': +import nightly_email_helper + +if __name__ == "__main__": today = date.today().strftime("%Y-%m-%d") subject = "Triton Client Memory Growth " + sys.argv[1] + " Summary: " + today memory_graphs = glob.glob("client_memory_growth*.log") write_up = "

This test is run for both HTTP and GRPC protocols using C++ and Python test scripts. The max-allowed difference between mean and maximum memory usage is set to 10MB and 1MB for C++ and Python tests individually.

" write_up += "

• What to look for
A linear memory growth in the beginning of the graph is acceptable only when it is followed by a flat memory usage. If a linear memory growth is observed during the entire test then there is possibly a memory leak.

" - html_content = "
" + write_up + "
"
+    html_content = (
+        '        
         
 
'
+        + write_up
+        + '
'
+    )
     for mem_graph in sorted(memory_graphs):
         html_content += "\n" + mem_graph + "\n"
         with open(mem_graph, "r") as f:
diff --git a/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt b/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt
index 8d3a78baf4..6a2a76bde5 100644
--- a/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt
+++ b/qa/L0_client_memory_growth/models/custom_identity_int32/config.pbtxt
@@ -35,7 +35,7 @@ input [
     name: "INPUT0"
     data_type: TYPE_INT32
     dims: [ -1 ]
-    
+
   }
 ]
 output [
diff --git a/qa/L0_client_memory_growth/test.sh b/qa/L0_client_memory_growth/test.sh
index ecb0493b28..73188812b2 100755
--- a/qa/L0_client_memory_growth/test.sh
+++ b/qa/L0_client_memory_growth/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -59,13 +59,23 @@ source ../common/util.sh
 # Set the number of repetitions in nightly and weekly tests
 # Set the email subject for nightly and weekly tests
 if [ "$TRITON_PERF_WEEKLY" == 1 ]; then
-    # Run the test for each case approximately 1.5 hours
-    # All tests are run cumulatively for 7 hours
-    REPETITION_HTTP_CPP=1300000
-    REPETITION_HTTP_PY=2100000
-    REPETITION_GRPC_CPP=10000000
-    REPETITION_GRPC_PY=1500000
-    EMAIL_SUBJECT="Weekly"
+    if [ "$TRITON_PERF_LONG" == 1 ]; then
+        # ~ 12 hours
+        # GRPC cycles are reduced as there is high fluctuation in time spent
+        REPETITION_HTTP_CPP=2220000
+        REPETITION_HTTP_PY=3600000
+        REPETITION_GRPC_CPP=8000000
+        REPETITION_GRPC_PY=1500000
+        EMAIL_SUBJECT="Weekly Long"
+    else
+        # Run the test for each case approximately 1.5 hours
+        # All tests are run cumulatively for 7 hours
+        REPETITION_HTTP_CPP=1300000
+        REPETITION_HTTP_PY=2100000
+        REPETITION_GRPC_CPP=6600000
+        REPETITION_GRPC_PY=1000000
+        EMAIL_SUBJECT="Weekly"
+    fi
 else
     REPETITION_CPP=100000
     REPETITION_PY=10000
@@ -106,6 +116,13 @@ for PROTOCOL in http grpc; do
         if [ "$LANG" == "c++" ]; then
             MEMORY_GROWTH_TEST=$MEMORY_GROWTH_TEST_CPP
             MAX_ALLOWED_ALLOC="10"
+            # NOTE: This test has risk of exhausting all available sockets in
+            # the ephemeral port range. Re-using the same client connection
+            # ("-R") can easily solve this problem. However, to cleanly separate
+            # the resources used by different client objects, we create new
+            # connections for each request and retry/sleep on failure to give
+            # the system time to reclaim sockets after TIME_WAIT.
+            # TIP: You can use the "ss -s" command to observe the socket usage.
             EXTRA_ARGS="-r ${REPETITION_CPP} -i ${PROTOCOL}"
         else
             MEMORY_GROWTH_TEST="python $MEMORY_GROWTH_TEST_PY"
@@ -113,18 +130,21 @@ for PROTOCOL in http grpc; do
             EXTRA_ARGS="-r ${REPETITION_PY} -i ${PROTOCOL}"
         fi
 
+        set +e
         SECONDS=0
         $LEAKCHECK $LEAKCHECK_ARGS $MEMORY_GROWTH_TEST $EXTRA_ARGS >> ${CLIENT_LOG} 2>&1
+        TEST_RETCODE=$?
         TEST_DURATION=$SECONDS
-        if [ $? -ne 0 ]; then
+        set -e
+        if [ ${TEST_RETCODE} -ne 0 ]; then
             cat ${CLIENT_LOG}
             RET=1
             echo -e "\n***\n*** Test FAILED\n***"
         else
             python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
             if [ $? -ne 0 ]; then
-            echo -e "\n***\n*** Memory leak detected\n***"
-            RET=1
+                echo -e "\n***\n*** Memory leak detected\n***"
+                RET=1
             fi
 
             set +e
@@ -159,8 +179,8 @@ else
 fi
 
 # Run only if both TRITON_FROM and TRITON_TO_DL are set
-if [[ ! -z "$TRITON_FROM" ]] || [[ ! -z "$TRITON_TO_DL" ]]; then
-    python client_memory_mail.py $EMAIL_SUBJECT
+if [[ ! -z "$TRITON_FROM" ]] && [[ ! -z "$TRITON_TO_DL" ]]; then
+    python client_memory_mail.py "$EMAIL_SUBJECT"
 fi
 
 exit $RET
diff --git a/qa/L0_client_nobatch/client_test.py b/qa/L0_client_nobatch/client_test.py
old mode 100644
new mode 100755
index b2f9467df1..c821d446d2
--- a/qa/L0_client_nobatch/client_test.py
+++ b/qa/L0_client_nobatch/client_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,20 +27,19 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import unittest
+
 import numpy as np
-import tritonhttpclient
+import test_util as tu
 import tritongrpcclient
+import tritonhttpclient
 from tritonclientutils import InferenceServerException
-import test_util as tu
 
 
 class ClientNoBatchTest(tu.TestResultCollector):
-
     def test_nobatch_request_for_batching_model(self):
         input_size = 16
 
@@ -47,53 +48,46 @@ def test_nobatch_request_for_batching_model(self):
         # input shapes.
         tensor_shape = (input_size,)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef", np.int32, np.int8,
-                                           np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name("graphdef", np.int32, np.int8, np.int8)
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
             inputs[1].set_data_from_numpy(in1)
 
             try:
-                results = triton_client.infer(model_name,
-                                              inputs,
-                                              outputs=outputs)
+                _ = triton_client.infer(model_name, inputs, outputs=outputs)
                 self.assertTrue(
-                    False,
-                    "expected failure with no batch request for batching model")
+                    False, "expected failure with no batch request for batching model"
+                )
             except InferenceServerException as ex:
                 pass
 
@@ -105,53 +99,48 @@ def test_batch_request_for_nobatching_model(self):
         # is included in the shape
         tensor_shape = (1, input_size)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef_nobatch", np.int32,
-                                           np.int8, np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name(
+                "graphdef_nobatch", np.int32, np.int8, np.int8
+            )
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
             inputs[1].set_data_from_numpy(in1)
 
             try:
-                results = triton_client.infer(model_name,
-                                              inputs,
-                                              outputs=outputs)
+                _ = triton_client.infer(model_name, inputs, outputs=outputs)
                 self.assertTrue(
                     False,
-                    "expected failure with batched request for non-batching model"
+                    "expected failure with batched request for non-batching model",
                 )
             except InferenceServerException as ex:
                 pass
@@ -164,41 +153,38 @@ def test_nobatch_request_for_nonbatching_model(self):
         # input shapes.
         tensor_shape = (input_size,)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef_nobatch", np.int32,
-                                           np.int8, np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name(
+                "graphdef_nobatch", np.int32, np.int8, np.int8
+            )
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
@@ -214,41 +200,36 @@ def test_batch_request_for_batching_model(self):
         # is included in the shape
         tensor_shape = (1, input_size)
         for protocol in ["http", "grpc"]:
-            model_name = tu.get_model_name("graphdef", np.int32, np.int8,
-                                           np.int8)
-            in0 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
-            in1 = np.random.randint(low=0,
-                                    high=100,
-                                    size=tensor_shape,
-                                    dtype=np.int32)
+            model_name = tu.get_model_name("graphdef", np.int32, np.int8, np.int8)
+            in0 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
+            in1 = np.random.randint(low=0, high=100, size=tensor_shape, dtype=np.int32)
 
             inputs = []
             outputs = []
             if protocol == "http":
                 triton_client = tritonhttpclient.InferenceServerClient(
-                    url='localhost:8000', verbose=True)
+                    url="localhost:8000", verbose=True
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritonhttpclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritonhttpclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT1'))
+                    tritonhttpclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT1"))
             else:
                 triton_client = tritongrpcclient.InferenceServerClient(
-                    url='localhost:8001', verbose=True)
+                    url="localhost:8001", verbose=True
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT0', tensor_shape,
-                                                "INT32"))
+                    tritongrpcclient.InferInput("INPUT0", tensor_shape, "INT32")
+                )
                 inputs.append(
-                    tritongrpcclient.InferInput('INPUT1', tensor_shape,
-                                                "INT32"))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0'))
-                outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1'))
+                    tritongrpcclient.InferInput("INPUT1", tensor_shape, "INT32")
+                )
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0"))
+                outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1"))
 
             # Initialize the data
             inputs[0].set_data_from_numpy(in0)
@@ -257,5 +238,5 @@ def test_batch_request_for_batching_model(self):
             results = triton_client.infer(model_name, inputs, outputs=outputs)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_client_timeout/client_timeout_test.py b/qa/L0_client_timeout/client_infer_timeout_test.py
old mode 100644
new mode 100755
similarity index 61%
rename from qa/L0_client_timeout/client_timeout_test.py
rename to qa/L0_client_timeout/client_infer_timeout_test.py
index 4f4a59bcea..700e9bfe9b
--- a/qa/L0_client_timeout/client_timeout_test.py
+++ b/qa/L0_client_timeout/client_infer_timeout_test.py
@@ -1,5 +1,6 @@
-#!/bin/bash
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,24 +27,22 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from functools import partial
-import numpy as np
 import queue
-import unittest
-import os
-import time
 import socket
-import test_util as tu
+import unittest
+from functools import partial
 
-import tritongrpcclient as grpcclient
-import tritonhttpclient as httpclient
-from tritonclientutils import InferenceServerException
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+from tritonclient.utils import InferenceServerException
 
 
 class UserData:
-
     def __init__(self):
         self._completed_requests = queue.Queue()
 
@@ -55,55 +54,60 @@ def callback(user_data, result, error):
         user_data._completed_requests.put(result)
 
 
-class ClientTimeoutTest(tu.TestResultCollector):
-
+class ClientInferTimeoutTest(tu.TestResultCollector):
     def setUp(self):
         self.model_name_ = "custom_identity_int32"
         self.input0_data_ = np.array([[10]], dtype=np.int32)
+        self.input0_data_byte_size_ = 32
+        self.INFER_SMALL_INTERVAL = 2.0  # seconds for a timeout
 
     def _prepare_request(self, protocol):
-        if (protocol == "grpc"):
+        if protocol == "grpc":
             self.inputs_ = []
-            self.inputs_.append(grpcclient.InferInput('INPUT0', [1, 1],
-                                                      "INT32"))
+            self.inputs_.append(grpcclient.InferInput("INPUT0", [1, 1], "INT32"))
             self.outputs_ = []
-            self.outputs_.append(grpcclient.InferRequestedOutput('OUTPUT0'))
+            self.outputs_.append(grpcclient.InferRequestedOutput("OUTPUT0"))
         else:
             self.inputs_ = []
-            self.inputs_.append(httpclient.InferInput('INPUT0', [1, 1],
-                                                      "INT32"))
+            self.inputs_.append(httpclient.InferInput("INPUT0", [1, 1], "INT32"))
             self.outputs_ = []
-            self.outputs_.append(httpclient.InferRequestedOutput('OUTPUT0'))
+            self.outputs_.append(httpclient.InferRequestedOutput("OUTPUT0"))
 
         self.inputs_[0].set_data_from_numpy(self.input0_data_)
 
     def test_grpc_infer(self):
-        triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
         self._prepare_request("grpc")
 
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(InferenceServerException) as cm:
-            result = triton_client.infer(model_name=self.model_name_,
-                                         inputs=self.inputs_,
-                                         outputs=self.outputs_,
-                                         client_timeout=0.2)
+            _ = triton_client.infer(
+                model_name=self.model_name_,
+                inputs=self.inputs_,
+                outputs=self.outputs_,
+                client_timeout=0.2,
+            )
         self.assertIn("Deadline Exceeded", str(cm.exception))
 
         # Expect inference to pass successfully for a large timeout
         # value
-        result = triton_client.infer(model_name=self.model_name_,
-                                     inputs=self.inputs_,
-                                     outputs=self.outputs_,
-                                     client_timeout=10)
-
-        output0_data = result.as_numpy('OUTPUT0')
+        result = triton_client.infer(
+            model_name=self.model_name_,
+            inputs=self.inputs_,
+            outputs=self.outputs_,
+            client_timeout=10,
+        )
+
+        output0_data = result.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_grpc_async_infer(self):
-        triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
         self._prepare_request("grpc")
 
         user_data = UserData()
@@ -111,11 +115,13 @@ def test_grpc_async_infer(self):
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(InferenceServerException) as cm:
-            triton_client.async_infer(model_name=self.model_name_,
-                                      inputs=self.inputs_,
-                                      callback=partial(callback, user_data),
-                                      outputs=self.outputs_,
-                                      client_timeout=2)
+            triton_client.async_infer(
+                model_name=self.model_name_,
+                inputs=self.inputs_,
+                callback=partial(callback, user_data),
+                outputs=self.outputs_,
+                client_timeout=self.INFER_SMALL_INTERVAL,
+            )
             data_item = user_data._completed_requests.get()
             if type(data_item) == InferenceServerException:
                 raise data_item
@@ -123,23 +129,25 @@ def test_grpc_async_infer(self):
 
         # Expect inference to pass successfully for a large timeout
         # value
-        triton_client.async_infer(model_name=self.model_name_,
-                                  inputs=self.inputs_,
-                                  callback=partial(callback, user_data),
-                                  outputs=self.outputs_,
-                                  client_timeout=10)
+        triton_client.async_infer(
+            model_name=self.model_name_,
+            inputs=self.inputs_,
+            callback=partial(callback, user_data),
+            outputs=self.outputs_,
+            client_timeout=10,
+        )
 
         # Wait until the results are available in user_data
         data_item = user_data._completed_requests.get()
         self.assertFalse(type(data_item) == InferenceServerException)
 
-        output0_data = data_item.as_numpy('OUTPUT0')
+        output0_data = data_item.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_grpc_stream_infer(self):
-
-        triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
 
         self._prepare_request("grpc")
         user_data = UserData()
@@ -148,11 +156,12 @@ def test_grpc_stream_infer(self):
         # response. Expect an exception for small timeout values.
         with self.assertRaises(InferenceServerException) as cm:
             triton_client.stop_stream()
-            triton_client.start_stream(callback=partial(callback, user_data),
-                                       stream_timeout=1)
-            triton_client.async_stream_infer(model_name=self.model_name_,
-                                             inputs=self.inputs_,
-                                             outputs=self.outputs_)
+            triton_client.start_stream(
+                callback=partial(callback, user_data), stream_timeout=1
+            )
+            triton_client.async_stream_infer(
+                model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+            )
             data_item = user_data._completed_requests.get()
             if type(data_item) == InferenceServerException:
                 raise data_item
@@ -161,73 +170,79 @@ def test_grpc_stream_infer(self):
         # Expect inference to pass successfully for a large timeout
         # value
         triton_client.stop_stream()
-        triton_client.start_stream(callback=partial(callback, user_data),
-                                   stream_timeout=100)
+        triton_client.start_stream(
+            callback=partial(callback, user_data), stream_timeout=100
+        )
 
-        triton_client.async_stream_infer(model_name=self.model_name_,
-                                         inputs=self.inputs_,
-                                         outputs=self.outputs_)
+        triton_client.async_stream_infer(
+            model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+        )
         data_item = user_data._completed_requests.get()
         triton_client.stop_stream()
 
         if type(data_item) == InferenceServerException:
             raise data_item
-        output0_data = data_item.as_numpy('OUTPUT0')
+        output0_data = data_item.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_http_infer(self):
-
         self._prepare_request("http")
 
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(socket.timeout) as cm:
             triton_client = httpclient.InferenceServerClient(
-                url="localhost:8000", verbose=True, network_timeout=2.0)
-            result = triton_client.infer(model_name=self.model_name_,
-                                         inputs=self.inputs_,
-                                         outputs=self.outputs_)
+                url="localhost:8000",
+                verbose=True,
+                network_timeout=self.INFER_SMALL_INTERVAL,
+            )
+            _ = triton_client.infer(
+                model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+            )
         self.assertIn("timed out", str(cm.exception))
 
         # Expect to successfully pass with sufficiently large timeout
         triton_client = httpclient.InferenceServerClient(
-            url="localhost:8000", verbose=True, connection_timeout=10.0)
+            url="localhost:8000", verbose=True, connection_timeout=10.0
+        )
 
-        result = triton_client.infer(model_name=self.model_name_,
-                                     inputs=self.inputs_,
-                                     outputs=self.outputs_)
+        result = triton_client.infer(
+            model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+        )
 
-        output0_data = result.as_numpy('OUTPUT0')
+        output0_data = result.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
     def test_http_async_infer(self):
-
         self._prepare_request("http")
 
         # The model is configured to take three seconds to send the
         # response. Expect an exception for small timeout values.
         with self.assertRaises(socket.timeout) as cm:
             triton_client = httpclient.InferenceServerClient(
-                url="localhost:8000", verbose=True, network_timeout=2.0)
+                url="localhost:8000",
+                verbose=True,
+                network_timeout=self.INFER_SMALL_INTERVAL,
+            )
             async_request = triton_client.async_infer(
-                model_name=self.model_name_,
-                inputs=self.inputs_,
-                outputs=self.outputs_)
+                model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+            )
             result = async_request.get_result()
         self.assertIn("timed out", str(cm.exception))
 
         # Expect to successfully pass with sufficiently large timeout
         triton_client = httpclient.InferenceServerClient(
-            url="localhost:8000", verbose=True, connection_timeout=10.0)
+            url="localhost:8000", verbose=True, connection_timeout=10.0
+        )
 
-        async_request = triton_client.async_infer(model_name=self.model_name_,
-                                                  inputs=self.inputs_,
-                                                  outputs=self.outputs_)
+        async_request = triton_client.async_infer(
+            model_name=self.model_name_, inputs=self.inputs_, outputs=self.outputs_
+        )
         result = async_request.get_result()
 
-        output0_data = result.as_numpy('OUTPUT0')
+        output0_data = result.as_numpy("OUTPUT0")
         self.assertTrue(np.array_equal(self.input0_data_, output0_data))
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_client_timeout/client_non_infer_timeout_test.py b/qa/L0_client_timeout/client_non_infer_timeout_test.py
new file mode 100755
index 0000000000..bbaf8c34e8
--- /dev/null
+++ b/qa/L0_client_timeout/client_non_infer_timeout_test.py
@@ -0,0 +1,340 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class ClientNonInferTimeoutTest(tu.TestResultCollector):
+    def setUp(self):
+        self.model_name_ = "custom_identity_int32"
+        self.input0_data_ = np.array([[10]], dtype=np.int32)
+        self.input0_data_byte_size_ = 32
+        self.SMALL_INTERVAL = 0.1  # seconds for a timeout
+        self.NORMAL_INTERVAL = 5.0  # seconds for server to load then receive request
+
+    def test_grpc_server_live(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.is_server_live(client_timeout=self.SMALL_INTERVAL)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        self.assertTrue(
+            triton_client.is_server_live(client_timeout=self.NORMAL_INTERVAL)
+        )
+
+    def test_grpc_is_server_ready(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.is_server_ready(client_timeout=self.SMALL_INTERVAL)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        self.assertTrue(
+            triton_client.is_server_ready(client_timeout=self.NORMAL_INTERVAL)
+        )
+
+    def test_grpc_is_model_ready(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.is_model_ready(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        self.assertTrue(
+            triton_client.is_model_ready(
+                model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+            )
+        )
+
+    def test_grpc_get_server_metadata(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_server_metadata(client_timeout=self.SMALL_INTERVAL)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+        triton_client.get_server_metadata(client_timeout=self.NORMAL_INTERVAL)
+
+    def test_grpc_get_model_metadata(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_model_metadata(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_model_metadata(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_model_config(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_model_config(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_model_config(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_model_repository_index(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_model_repository_index(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_model_repository_index(client_timeout=self.NORMAL_INTERVAL)
+
+    def test_grpc_load_model(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        triton_client.unload_model(model_name=self.model_name_)
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.load_model(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unload_model(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+        triton_client.load_model(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_unload_model(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.unload_model(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.load_model(model_name=self.model_name_)
+        triton_client.unload_model(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+        triton_client.load_model(model_name=self.model_name_)
+
+    def test_grpc_get_inference_statistics(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_inference_statistics(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_inference_statistics(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_update_trace_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.update_trace_settings(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.update_trace_settings(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_trace_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_trace_settings(
+                model_name=self.model_name_, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_trace_settings(
+            model_name=self.model_name_, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_update_log_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        settings = {}
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.update_log_settings(
+                settings=settings, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.update_log_settings(
+            settings=settings, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_log_settings(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_log_settings(
+                as_json=True, client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_log_settings(
+            as_json=True, client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_system_shared_memory_status(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_system_shared_memory_status(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_system_shared_memory_status(
+            client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_register_system_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        triton_client.unregister_system_shared_memory()
+        import tritonclient.utils.shared_memory as shm
+
+        shm_ip0_handle = shm.create_shared_memory_region(
+            "input0_data", "/input_simple", self.input0_data_byte_size_
+        )
+        shm.set_shared_memory_region(shm_ip0_handle, [self.input0_data_])
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.register_system_shared_memory(
+                "input0_data",
+                "/input_simple",
+                self.input0_data_byte_size_,
+                client_timeout=self.SMALL_INTERVAL,
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_system_shared_memory()
+        triton_client.register_system_shared_memory(
+            "input0_data",
+            "/input_simple",
+            self.input0_data_byte_size_,
+            client_timeout=self.NORMAL_INTERVAL,
+        )
+        triton_client.unregister_system_shared_memory()
+
+    def test_grpc_unregister_system_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.unregister_system_shared_memory(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_system_shared_memory(
+            client_timeout=self.NORMAL_INTERVAL
+        )
+
+    def test_grpc_get_cuda_shared_memory_status(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.get_cuda_shared_memory_status(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.get_cuda_shared_memory_status(client_timeout=self.NORMAL_INTERVAL)
+
+    def test_grpc_register_cuda_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        import tritonclient.utils.cuda_shared_memory as cshm
+
+        input_data = np.array([[10]], dtype=np.int32)
+        byteSize = input_data.itemsize * input_data.size
+        shm_op0_handle = cshm.create_shared_memory_region(
+            "dummy_data", byte_size=byteSize, device_id=0
+        )
+        cshm.set_shared_memory_region(shm_op0_handle, [input_data])
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.register_cuda_shared_memory(
+                "dummy_data",
+                cshm.get_raw_handle(shm_op0_handle),
+                device_id=0,
+                byte_size=byteSize,
+                client_timeout=self.SMALL_INTERVAL,
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_cuda_shared_memory()
+        triton_client.register_cuda_shared_memory(
+            "dummy_data",
+            cshm.get_raw_handle(shm_op0_handle),
+            device_id=0,
+            byte_size=byteSize,
+            client_timeout=self.NORMAL_INTERVAL,
+        )
+        cshm.destroy_shared_memory_region(shm_op0_handle)
+
+    def test_grpc_unregister_cuda_shared_memory(self):
+        triton_client = grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        )
+        with self.assertRaises(InferenceServerException) as cm:
+            _ = triton_client.unregister_cuda_shared_memory(
+                client_timeout=self.SMALL_INTERVAL
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+        triton_client.unregister_cuda_shared_memory(client_timeout=self.NORMAL_INTERVAL)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt b/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt
index a42c5dcd45..1732ff32fd 100644
--- a/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt
+++ b/qa/L0_client_timeout/models/custom_identity_int32/config.pbtxt
@@ -35,7 +35,7 @@ input [
     name: "INPUT0"
     data_type: TYPE_INT32
     dims: [ -1 ]
-    
+
   }
 ]
 output [
diff --git a/qa/L0_client_timeout/test.sh b/qa/L0_client_timeout/test.sh
old mode 100644
new mode 100755
index a832694b84..f250dc9fa3
--- a/qa/L0_client_timeout/test.sh
+++ b/qa/L0_client_timeout/test.sh
@@ -39,10 +39,12 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then
 fi
 
 export CUDA_VISIBLE_DEVICES=0
-
+TIMEOUT_VALUE=100000000
+SHORT_TIMEOUT_VALUE=1000
 RET=0
 
-CLIENT_TIMEOUT_TEST=client_timeout_test.py
+CLIENT_INFER_TIMEOUT_TEST=client_infer_timeout_test.py
+CLIENT_NON_INFER_TIMEOUT_TEST=client_non_infer_timeout_test.py
 CLIENT_TIMEOUT_TEST_CPP=../clients/client_timeout_test
 TEST_RESULT_FILE='test_results.txt'
 
@@ -50,27 +52,62 @@ rm -f *.log
 rm -f *.log.*
 
 CLIENT_LOG=`pwd`/client.log
+CLIENT_GRPC_TIMEOUTS_LOG=`pwd`/client.log.grpc
 DATADIR=`pwd`/models
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--model-repository=$DATADIR"
+SERVER_ARGS="--model-repository=$DATADIR --model-control-mode=explicit --load-model=custom_identity_int32 --log-verbose 2"
 source ../common/util.sh
 
 mkdir -p $DATADIR/custom_identity_int32/1
 
+# Test all APIs apart from Infer.
+export TRITONSERVER_SERVER_DELAY_GRPC_RESPONSE_SEC=2
 run_server
+if [ $? -eq 1 ]; then
+    echo -e "\n***\n*** Test Failed: GRPC non-infer APIs\n***"
+    RET=1
+fi
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
     exit 1
 fi
 
+set +e
+# Expect timeout for everything
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc -p >> ${CLIENT_LOG}.c++.grpc_non_infer_apis 2>&1
+if [ `grep -c "Deadline Exceeded" ${CLIENT_LOG}.c++.grpc_non_infer_apis` != "18" ]; then
+    cat ${CLIENT_LOG}.c++.grpc_non_infer_apis
+    echo -e "\n***\n*** Test Failed. Expected 18 failed\n***"
+    RET=1
+fi
+# Test all APIs with long timeout
+$CLIENT_TIMEOUT_TEST_CPP -t $TIMEOUT_VALUE -v -i grpc -p >> ${CLIENT_LOG} 2>&1
+if [ $? -eq 0 ]; then
+    echo -e "\n***\n*** Test Failed: GRPC non-infer APIs\n***"
+    RET=1
+fi
+
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test infer APIs
+unset TRITONSERVER_SERVER_DELAY_GRPC_RESPONSE_SEC
+SERVER_ARGS="--model-repository=$DATADIR --log-verbose 2"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
 set +e
 
 # CASE 1: Provide too small a timeout and expect a failure.
 # Note, the custom_identity_int32 is configured with a delay
 # of 3 sec.
 # Test request timeout in grpc synchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -i grpc >> ${CLIENT_LOG}.c++.grpc_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc >> ${CLIENT_LOG}.c++.grpc_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -81,7 +118,7 @@ if [ `grep -c "Deadline Exceeded" ${CLIENT_LOG}.c++.grpc_infer` != "1" ]; then
 fi
 
 # Test request timeout in grpc asynchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -i grpc -a >> ${CLIENT_LOG}.c++.grpc_async_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc -a >> ${CLIENT_LOG}.c++.grpc_async_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -92,7 +129,7 @@ if [ `grep -c "Deadline Exceeded" ${CLIENT_LOG}.c++.grpc_async_infer` != "1" ];
 fi
 
 # Test stream timeout in grpc asynchronous streaming inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -i grpc -s >> ${CLIENT_LOG}.c++.grpc_async_stream_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -i grpc -s >> ${CLIENT_LOG}.c++.grpc_async_stream_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -103,7 +140,7 @@ if [ `grep -c "Stream has been closed" ${CLIENT_LOG}.c++.grpc_async_stream_infer
 fi
 
 # Test request timeout in http synchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v >> ${CLIENT_LOG}.c++.http_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v >> ${CLIENT_LOG}.c++.http_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -115,7 +152,7 @@ fi
 
 
 # Test request timeout in http asynchronous inference
-$CLIENT_TIMEOUT_TEST_CPP -t 1000 -v -a >> ${CLIENT_LOG}.c++.http_async_infer 2>&1
+$CLIENT_TIMEOUT_TEST_CPP -t $SHORT_TIMEOUT_VALUE -v -a >> ${CLIENT_LOG}.c++.http_async_infer 2>&1
 if [ $? -eq 0 ]; then
     RET=1
 fi
@@ -136,7 +173,6 @@ fi
 
 
 # CASE 2: Provide sufficiently large timeout value
-TIMEOUT_VALUE=100000000
 set +e
 
 echo "TEST:  GRPC Synchronous" >> ${CLIENT_LOG}
@@ -174,7 +210,6 @@ if [ $? -ne 0 ]; then
     RET=1
 fi
 
-
 echo "TEST:  Python Library" >> ${CLIENT_LOG}
 
 # CASE 3: Python Library
@@ -185,7 +220,7 @@ for i in test_grpc_infer \
     test_http_infer \
     test_http_async_infer \
    ; do
-    python $CLIENT_TIMEOUT_TEST ClientTimeoutTest.$i >>$CLIENT_LOG 2>&1
+    python $CLIENT_INFER_TIMEOUT_TEST ClientInferTimeoutTest.$i >>$CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
             echo -e "\n***\n*** Test $i Failed\n***"
@@ -204,6 +239,28 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+# Test all APIs other than infer
+export TRITONSERVER_SERVER_DELAY_GRPC_RESPONSE_SEC=2
+SERVER_ARGS="${SERVER_ARGS} --model-control-mode=explicit --load-model=custom_identity_int32 --log-verbose 2"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+set +e
+
+python $CLIENT_NON_INFER_TIMEOUT_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+fi
+
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
@@ -211,4 +268,5 @@ else
     echo -e "\n***\n*** Test FAILED\n***"
 fi
 
+set +e
 exit $RET
diff --git a/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt b/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt
index 8d3a78baf4..6a2a76bde5 100644
--- a/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt
+++ b/qa/L0_client_valgrind/models/custom_identity_int32/config.pbtxt
@@ -35,7 +35,7 @@ input [
     name: "INPUT0"
     data_type: TYPE_INT32
     dims: [ -1 ]
-    
+
   }
 ]
 output [
diff --git a/qa/L0_client_valgrind/test.sh b/qa/L0_client_valgrind/test.sh
index 062417753c..0870aa883c 100755
--- a/qa/L0_client_valgrind/test.sh
+++ b/qa/L0_client_valgrind/test.sh
@@ -87,8 +87,8 @@ for PROTOCOL in http grpc; do
         else
             python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
             if [ $? -ne 0 ]; then
-            echo -e "\n***\n*** Memory leak detected\n***"
-            RET=1
+                echo -e "\n***\n*** Memory leak detected\n***"
+                RET=1
             fi
         fi
     done
diff --git a/qa/L0_cmdline_trace/test.sh b/qa/L0_cmdline_trace/test.sh
index efe20ac386..66f9a08fc0 100755
--- a/qa/L0_cmdline_trace/test.sh
+++ b/qa/L0_cmdline_trace/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,9 +25,8 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-SIMPLE_HTTP_CLIENT=../clients/simple_http_infer_client
-SIMPLE_GRPC_CLIENT=../clients/simple_grpc_infer_client
 TRACE_SUMMARY=../common/trace_summary.py
+CLIENT_SCRIPT=trace_client.py
 
 REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
 if [ "$#" -ge 1 ]; then
@@ -79,12 +78,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_off.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_off.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_off.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_off.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -117,12 +116,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_min.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_min.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_min.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_min.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -165,12 +164,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_max.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_max.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_max.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_max.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -212,12 +211,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_1.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_1.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_1.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_1.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -260,12 +259,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_6.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_6.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_6.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_6.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -309,12 +308,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_frequency.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_frequency.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_frequency.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_frequency.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -370,12 +369,12 @@ fi
 set +e
 
 for p in {1..10}; do
-    $SIMPLE_HTTP_CLIENT >> client_9.log 2>&1
+    python3 $CLIENT_SCRIPT -i grpc -u localhost:8001 >> client_9.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
 
-    $SIMPLE_GRPC_CLIENT >> client_9.log 2>&1
+    python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_9.log 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
@@ -439,7 +438,7 @@ fi
 
 set +e
 
-$SIMPLE_HTTP_CLIENT >> client_ensemble.log 2>&1
+python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_ensemble.log 2>&1
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
@@ -461,15 +460,15 @@ if [ `grep -c "COMPUTE_INPUT_END" summary_ensemble.log` != "7" ]; then
 fi
 
 for trace_str in \
-        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1}" \
-        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" ; do
+        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1,\"request_id\":\"1\"}" \
+        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" ; do
     if [ `grep -c ${trace_str} trace_ensemble.log` != "1" ]; then
         echo -e "Ensemble trace log expects trace: ${trace_str}"
         RET=1
@@ -485,12 +484,6 @@ fi
 set -e
 
 
-if [ $RET -eq 0 ]; then
-    echo -e "\n***\n*** Test Passed\n***"
-else
-    echo -e "\n***\n*** Test FAILED\n***"
-fi
-
 # trace-rate == 1, trace-level=TIMESTAMPS, trace-level=TENSORS
 SERVER_ARGS="--http-thread-count=1 --trace-file=trace_ensemble_tensor.log \
              --trace-level=TIMESTAMPS --trace-level=TENSORS --trace-rate=1 --model-repository=$MODELSDIR"
@@ -504,7 +497,7 @@ fi
 
 set +e
 
-$SIMPLE_HTTP_CLIENT >> client_ensemble_tensor.log 2>&1
+python3 $CLIENT_SCRIPT -i http -u localhost:8000 >> client_ensemble_tensor.log 2>&1
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
@@ -525,15 +518,15 @@ if [ `grep -c "COMPUTE_INPUT_END" summary_ensemble_tensor.log` != "7" ]; then
     RET=1
 fi
 for trace_str in \
-        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1}" \
-        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":3}" \
-        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" \
-        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"parent_id\":1}" ; do
+        "{\"id\":1,\"model_name\":\"simple\",\"model_version\":1,\"request_id\":\"1\"}" \
+        "{\"id\":2,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":3,\"model_name\":\"fan_${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":4,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":5,\"model_name\":\"${MODELBASE}\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":6,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":7,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":3}" \
+        "{\"id\":8,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" \
+        "{\"id\":9,\"model_name\":\"nop_TYPE_INT32_-1\",\"model_version\":1,\"request_id\":\"1\",\"parent_id\":1}" ; do
     if [ `grep -c ${trace_str} trace_ensemble_tensor.log` != "1" ]; then
         echo -e "Ensemble trace tensors log expects trace: ${trace_str}"
         RET=1
@@ -577,4 +570,59 @@ else
 fi
 
 
+# check deprecation warnings
+SERVER_ARGS=" --trace-file=/tmp/trace.json --trace-rate=100 --trace-level=TIMESTAMPS \
+              --trace-log-frequency=50 --trace-count=100 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_trace_config_flag.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+if [ `grep -c "Warning: '--trace-file' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-rate' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-level' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-log-frequency' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+if [ `grep -c "Warning: '--trace-count' has been deprecated" $SERVER_LOG` != "1" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+
 exit $RET
diff --git a/qa/L0_cmdline_trace/trace_client.py b/qa/L0_cmdline_trace/trace_client.py
new file mode 100755
index 0000000000..4d59579d7c
--- /dev/null
+++ b/qa/L0_cmdline_trace/trace_client.py
@@ -0,0 +1,79 @@
+#!/usr/bin/env python
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import sys
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8001",
+        help="Inference server URL. Default is localhost:8001.",
+    )
+    parser.add_argument("-i", "--protocol", type=str, required=True)
+    FLAGS = parser.parse_args()
+
+    if FLAGS.protocol == "grpc":
+        client_type = grpcclient
+    else:
+        client_type = httpclient
+
+    try:
+        triton_client = client_type.InferenceServerClient(url=FLAGS.url)
+    except Exception as e:
+        print("channel creation failed: " + str(e))
+        sys.exit()
+
+    model_name = "simple"
+
+    # Infer
+    inputs = []
+    outputs = []
+    inputs.append(client_type.InferInput("INPUT0", [1, 16], "INT32"))
+    inputs.append(client_type.InferInput("INPUT1", [1, 16], "INT32"))
+
+    input0_data = np.arange(start=0, stop=16, dtype=np.int32)
+    input0_data = np.expand_dims(input0_data, axis=0)
+    input1_data = np.ones(shape=(1, 16), dtype=np.int32)
+
+    inputs[0].set_data_from_numpy(input0_data)
+    inputs[1].set_data_from_numpy(input1_data)
+
+    outputs.append(client_type.InferRequestedOutput("OUTPUT0"))
+    outputs.append(client_type.InferRequestedOutput("OUTPUT1"))
+
+    triton_client.infer(
+        model_name=model_name, inputs=inputs, outputs=outputs, request_id="1"
+    )
diff --git a/qa/L0_config_json/max_priority_level.pbtxt b/qa/L0_config_json/max_priority_level.pbtxt
new file mode 100644
index 0000000000..f71f08d236
--- /dev/null
+++ b/qa/L0_config_json/max_priority_level.pbtxt
@@ -0,0 +1,62 @@
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "max_priority_level"
+backend: "identity"
+max_batch_size: 1
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT0"
+    data_type: TYPE_FP32
+    dims: [ 1 ]
+  }
+]
+
+dynamic_batching:
+{
+    # Max uint64
+    priority_levels: 18446744073709551615
+    # Max uint32
+    default_priority_level: 4294967295
+    # Max uint32 + 1
+    priority_queue_policy: [
+       {key: 4294967296
+        value: {
+          timeout_action: REJECT
+	  default_timeout_microseconds: 18446744073709551615
+	  allow_timeout_override: true
+	  max_queue_size: 10
+       }
+    }
+]
+}
\ No newline at end of file
diff --git a/qa/L0_config_json/test.sh b/qa/L0_config_json/test.sh
index 0b7f29e05b..b1016b806b 100755
--- a/qa/L0_config_json/test.sh
+++ b/qa/L0_config_json/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -373,6 +373,52 @@ fi
 kill $SERVER_PID
 wait $SERVER_PID
 
+# Test max_priority_level
+TRIAL=max_priority_level
+
+rm -fr models && mkdir models
+mkdir -p models/max_priority_level/1 && cp max_priority_level.pbtxt models/max_priority_level/config.pbtxt
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+code=`curl -s -w %{http_code} -o ./$TRIAL.out localhost:8000/v2/models/max_priority_level/config`
+set -e
+if [ "$code" != "200" ]; then
+    cat $TRIAL.out
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+declare -A expected_values
+
+MAX_UINT64=18446744073709551615
+MAX_UINT32=4294967295
+MAX_UINT32_PLUS_1=4294967296
+
+expected_values["priority_levels"]=$MAX_UINT64
+expected_values["default_priority_level"]=$MAX_UINT32
+expected_values[$MAX_UINT32_PLUS_1]=\{\"timeout_action\":\"REJECT\",\"default_timeout_microseconds\":18446744073709551615,\"allow_timeout_override\":true,\"max_queue_size\":10\}
+expected_values["default_timeout_microseconds"]=$MAX_UINT64
+
+for key in "${!expected_values[@]}"; do
+    value=${expected_values[$key]}
+    matches=`grep -o "\"$key\":$value" $TRIAL.out | wc -l`
+    if [ $matches -ne 1 ]; then
+	cat $TRIAL.out
+	echo -e "\n***\n*** Expected 1 $key == $value, got $matches\n***"
+	RET=1
+    fi
+done
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_cuda_graph/test.sh b/qa/L0_cuda_graph/test.sh
old mode 100644
new mode 100755
index 58a796eb4d..9388dba77d
--- a/qa/L0_cuda_graph/test.sh
+++ b/qa/L0_cuda_graph/test.sh
@@ -49,7 +49,7 @@ rm -rf ${DATADIR}
 mkdir -p ${DATADIR}
 
 SERVER=/opt/tritonserver/bin/tritonserver
-SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR"
+SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR --strict-model-config=true"
 SERVER_LOG="./inference_server.log"
 source ../common/util.sh
 
@@ -118,6 +118,7 @@ wait $SERVER_PID
 rm -rf ${DATADIR} && mkdir -p ${DATADIR}
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_variable_model_repository/plan_float32_float32_float32 ${DATADIR}/
 
+SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR --strict-model-config=true"
 CLIENT_LOG="./dynamic_shape.client.log"
 SERVER_LOG="./dynamic_shape.inference_server.log"
 sed -i "s/profile:.*/profile: [\"0\"]/" ${DATADIR}/plan_float32_float32_float32/config.pbtxt
@@ -167,6 +168,7 @@ cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/plan_float32_flo
 # Make sure only one version is present
 rm -rf ${DATADIR}/plan_float32_float32_float32/3
 
+SERVER_ARGS="--log-verbose=1 --model-repository=$DATADIR"
 CLIENT_LOG="./range_fixed_shape.client.log"
 SERVER_LOG="./range_fixed_shape.inference_server.log"
 echo "optimization { \
@@ -285,6 +287,53 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+# TrtCudaGraphTest.test_nobatch_fixed_shape
+rm -rf ${DATADIR} && mkdir -p ${DATADIR}
+cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/plan_nobatch_float32_float32_float32 ${DATADIR}/
+# Make sure only one version is present
+rm -rf ${DATADIR}/plan_nobatch_float32_float32_float32/2 ${DATADIR}/plan_nobatch_float32_float32_float32/3
+
+CLIENT_LOG="./nobatch_fixed_shape.client.log"
+SERVER_LOG="./nobatch_fixed_shape.inference_server.log"
+echo "optimization { cuda { graphs: true } }" >> ${DATADIR}/plan_nobatch_float32_float32_float32/config.pbtxt
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TRT_CUDA_GRAPH_TEST TrtCudaGraphTest.test_nobatch_fixed_shape plan_nobatch>>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test Failed\n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE 1
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+set +e
+if [ `grep -c "Context with profile default \[0\] is launching CUDA graph " $SERVER_LOG` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected only one execution with CUDA graph\n***"
+    RET=1
+fi
+
+if [ `grep -c "captured CUDA graph for" $SERVER_LOG` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected 1 CUDA graph to be captured\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
diff --git a/qa/L0_cuda_graph/trt_cuda_graph_test.py b/qa/L0_cuda_graph/trt_cuda_graph_test.py
old mode 100644
new mode 100755
index 851ae90ed2..a7f9f3be98
--- a/qa/L0_cuda_graph/trt_cuda_graph_test.py
+++ b/qa/L0_cuda_graph/trt_cuda_graph_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,39 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 from tritonclientutils import *
 
 
 class TrtCudaGraphTest(tu.TestResultCollector):
+    MODELNAME = "plan"
 
     def setUp(self):
         self.dtype_ = np.float32
         self.dtype_str_ = "FP32"
-        self.model_name_ = 'plan'
+        self.model_name_ = self.MODELNAME
 
     def _check_infer(self, tensor_shape, batch_size=1):
         try:
-            iu.infer_exact(self,
-                           self.model_name_, (batch_size,) + tensor_shape,
-                           batch_size,
-                           self.dtype_,
-                           self.dtype_,
-                           self.dtype_,
-                           model_version=1,
-                           use_http_json_tensors=False,
-                           use_grpc=False,
-                           use_streaming=False)
+            if batch_size:
+                full_shape = (batch_size,) + tensor_shape
+            else:
+                full_shape = tensor_shape
+            iu.infer_exact(
+                self,
+                self.model_name_,
+                full_shape,
+                batch_size,
+                self.dtype_,
+                self.dtype_,
+                self.dtype_,
+                model_version=1,
+                use_http_json_tensors=False,
+                use_grpc=False,
+                use_streaming=False,
+            )
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def _erroneous_infer(self, tensor_shape, batch_size):
         import tritonhttpclient
+
         item_size = batch_size
         for dim in tensor_shape:
             item_size *= dim
@@ -68,30 +81,38 @@ def _erroneous_infer(self, tensor_shape, batch_size):
 
         inputs = []
         inputs.append(
-            tritonhttpclient.InferInput('INPUT0', full_shape, self.dtype_str_))
+            tritonhttpclient.InferInput("INPUT0", full_shape, self.dtype_str_)
+        )
         inputs[-1].set_data_from_numpy(input_np)
         inputs.append(
-            tritonhttpclient.InferInput('INPUT1', full_shape, self.dtype_str_))
+            tritonhttpclient.InferInput("INPUT1", full_shape, self.dtype_str_)
+        )
         inputs[-1].set_data_from_numpy(input_np)
         outputs = []
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True)
+        )
         outputs.append(
-            tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True))
+            tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True)
+        )
 
-        model_name = tu.get_model_name(self.model_name_, self.dtype_,
-                                       self.dtype_, self.dtype_)
+        model_name = tu.get_model_name(
+            self.model_name_, self.dtype_, self.dtype_, self.dtype_
+        )
         results = tritonhttpclient.InferenceServerClient(
-            "localhost:8000", verbose=True).infer(model_name=model_name,
-                                                  inputs=inputs,
-                                                  outputs=outputs)
+            "localhost:8000", verbose=True
+        ).infer(model_name=model_name, inputs=inputs, outputs=outputs)
         # Validate the results by comparing with precomputed values.
-        output0_np = results.as_numpy('OUTPUT0')
-        output1_np = results.as_numpy('OUTPUT1')
-        self.assertFalse(np.array_equal(output0_np, expected_output0_np),
-                         "expects OUTPUT0 is not correct")
-        self.assertFalse(np.array_equal(output1_np, expected_output1_np),
-                         "expects OUTPUT1 is not correct")
+        output0_np = results.as_numpy("OUTPUT0")
+        output1_np = results.as_numpy("OUTPUT1")
+        self.assertFalse(
+            np.array_equal(output0_np, expected_output0_np),
+            "expects OUTPUT0 is not correct",
+        )
+        self.assertFalse(
+            np.array_equal(output1_np, expected_output1_np),
+            "expects OUTPUT1 is not correct",
+        )
 
     def test_fixed_shape(self):
         tensor_shape = (16,)
@@ -131,6 +152,12 @@ def test_range_dynamic_shape(self):
         self._check_infer((16,), 8)
         self._check_infer((30,), 4)
 
+    def test_nobatch_fixed_shape(self):
+        self._check_infer((16,), 0)
+
+
+if __name__ == "__main__":
+    if len(sys.argv) > 2:
+        TrtCudaGraphTest.MODELNAME = sys.argv.pop()
 
-if __name__ == '__main__':
     unittest.main()
diff --git a/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py b/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py
old mode 100644
new mode 100755
index 4bff4eba75..87fb7c1d3c
--- a/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py
+++ b/qa/L0_cuda_shared_memory/cuda_shared_memory_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,13 +27,14 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
-import unittest
 import os
-import test_util as tu
+import unittest
 
+import numpy as np
+import test_util as tu
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 import tritonshmutils.cuda_shared_memory as cshm
@@ -39,16 +42,13 @@
 
 
 class CudaSharedMemoryTest(tu.TestResultCollector):
-
     def test_invalid_create_shm(self):
         # Raises error since tried to create invalid cuda shared memory region
         try:
-            shm_op0_handle = cshm.create_shared_memory_region(
-                "dummy_data", -1, 0)
+            shm_op0_handle = cshm.create_shared_memory_region("dummy_data", -1, 0)
             cshm.destroy_shared_memory_region(shm_op0_handle)
         except Exception as ex:
-            self.assertEqual(str(ex),
-                             "unable to create cuda shared memory handle")
+            self.assertEqual(str(ex), "unable to create cuda shared memory handle")
 
     def test_valid_create_set_register(self):
         # Create a valid cuda shared memory region, fill data in it and register
@@ -57,10 +57,12 @@ def test_valid_create_set_register(self):
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         shm_op0_handle = cshm.create_shared_memory_region("dummy_data", 8, 0)
-        cshm.set_shared_memory_region(shm_op0_handle,
-                                      [np.array([1, 2], dtype=np.float32)])
+        cshm.set_shared_memory_region(
+            shm_op0_handle, [np.array([1, 2], dtype=np.float32)]
+        )
         triton_client.register_cuda_shared_memory(
-            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+        )
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
             self.assertEqual(len(shm_status), 1)
@@ -91,7 +93,8 @@ def test_unregister_after_register(self):
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         shm_op0_handle = cshm.create_shared_memory_region("dummy_data", 8, 0)
         triton_client.register_cuda_shared_memory(
-            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+        )
         triton_client.unregister_cuda_shared_memory("dummy_data")
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
@@ -108,13 +111,16 @@ def test_reregister_after_register(self):
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         shm_op0_handle = cshm.create_shared_memory_region("dummy_data", 8, 0)
         triton_client.register_cuda_shared_memory(
-            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+            "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+        )
         try:
             triton_client.register_cuda_shared_memory(
-                "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8)
+                "dummy_data", cshm.get_raw_handle(shm_op0_handle), 0, 8
+            )
         except Exception as ex:
             self.assertIn(
-                "shared memory region 'dummy_data' already in manager", str(ex))
+                "shared memory region 'dummy_data' already in manager", str(ex)
+            )
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
             self.assertEqual(len(shm_status), 1)
@@ -137,27 +143,33 @@ def _configure_sever(self):
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         triton_client.register_cuda_shared_memory(
-            "input0_data", cshm.get_raw_handle(shm_ip0_handle), 0, 64)
+            "input0_data", cshm.get_raw_handle(shm_ip0_handle), 0, 64
+        )
         triton_client.register_cuda_shared_memory(
-            "input1_data", cshm.get_raw_handle(shm_ip1_handle), 0, 64)
+            "input1_data", cshm.get_raw_handle(shm_ip1_handle), 0, 64
+        )
         triton_client.register_cuda_shared_memory(
-            "output0_data", cshm.get_raw_handle(shm_op0_handle), 0, 64)
+            "output0_data", cshm.get_raw_handle(shm_op0_handle), 0, 64
+        )
         triton_client.register_cuda_shared_memory(
-            "output1_data", cshm.get_raw_handle(shm_op1_handle), 0, 64)
+            "output1_data", cshm.get_raw_handle(shm_op1_handle), 0, 64
+        )
         return [shm_ip0_handle, shm_ip1_handle, shm_op0_handle, shm_op1_handle]
 
     def _cleanup_server(self, shm_handles):
         for shm_handle in shm_handles:
             cshm.destroy_shared_memory_region(shm_handle)
 
-    def _basic_inference(self,
-                         shm_ip0_handle,
-                         shm_ip1_handle,
-                         shm_op0_handle,
-                         shm_op1_handle,
-                         error_msg,
-                         big_shm_name="",
-                         big_shm_size=64):
+    def _basic_inference(
+        self,
+        shm_ip0_handle,
+        shm_ip1_handle,
+        shm_op0_handle,
+        shm_op1_handle,
+        error_msg,
+        big_shm_name="",
+        big_shm_size=64,
+    ):
         input0_data = np.arange(start=0, stop=16, dtype=np.int32)
         input1_data = np.ones(shape=16, dtype=np.int32)
         inputs = []
@@ -166,16 +178,16 @@ def _basic_inference(self,
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
             inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32"))
             inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32"))
+            outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True))
             outputs.append(
-                httpclient.InferRequestedOutput('OUTPUT0', binary_data=True))
-            outputs.append(
-                httpclient.InferRequestedOutput('OUTPUT1', binary_data=False))
+                httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)
+            )
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
             inputs.append(grpcclient.InferInput("INPUT0", [1, 16], "INT32"))
             inputs.append(grpcclient.InferInput("INPUT1", [1, 16], "INT32"))
-            outputs.append(grpcclient.InferRequestedOutput('OUTPUT0'))
-            outputs.append(grpcclient.InferRequestedOutput('OUTPUT1'))
+            outputs.append(grpcclient.InferRequestedOutput("OUTPUT0"))
+            outputs.append(grpcclient.InferRequestedOutput("OUTPUT1"))
         inputs[0].set_shared_memory("input0_data", 64)
         if type(shm_ip1_handle) == np.array:
             inputs[1].set_data_from_numpy(input0_data, binary_data=True)
@@ -187,22 +199,21 @@ def _basic_inference(self,
         outputs[1].set_shared_memory("output1_data", 64)
 
         try:
-            results = triton_client.infer("simple",
-                                          inputs,
-                                          model_version="",
-                                          outputs=outputs)
-            output = results.get_output('OUTPUT0')
+            results = triton_client.infer(
+                "simple", inputs, model_version="", outputs=outputs
+            )
+            output = results.get_output("OUTPUT0")
             if _protocol == "http":
-                output_datatype = output['datatype']
-                output_shape = output['shape']
+                output_datatype = output["datatype"]
+                output_shape = output["shape"]
             else:
                 output_datatype = output.datatype
                 output_shape = output.shape
             output_dtype = triton_to_np_dtype(output_datatype)
-            output_data = cshm.get_contents_as_numpy(shm_op0_handle,
-                                                     output_dtype, output_shape)
-            self.assertTrue(
-                (output_data[0] == (input0_data + input1_data)).all())
+            output_data = cshm.get_contents_as_numpy(
+                shm_op0_handle, output_dtype, output_shape
+            )
+            self.assertTrue((output_data[0] == (input0_data + input1_data)).all())
         except Exception as ex:
             error_msg.append(str(ex))
 
@@ -210,8 +221,9 @@ def test_unregister_after_inference(self):
         # Unregister after inference
         error_msg = []
         shm_handles = self._configure_sever()
-        self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(str(error_msg))
         if _protocol == "http":
@@ -234,13 +246,15 @@ def test_register_after_inference(self):
             triton_client = httpclient.InferenceServerClient(_url, verbose=True)
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
-        self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(str(error_msg))
         shm_ip2_handle = cshm.create_shared_memory_region("input2_data", 64, 0)
         triton_client.register_cuda_shared_memory(
-            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 64)
+            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 64
+        )
         shm_status = triton_client.get_cuda_shared_memory_status()
         if _protocol == "http":
             self.assertEqual(len(shm_status), 5)
@@ -259,13 +273,22 @@ def test_too_big_shm(self):
         else:
             triton_client = grpcclient.InferenceServerClient(_url, verbose=True)
         triton_client.register_cuda_shared_memory(
-            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 128)
-        self._basic_inference(shm_handles[0], shm_ip2_handle, shm_handles[2],
-                              shm_handles[3], error_msg, "input2_data", 128)
+            "input2_data", cshm.get_raw_handle(shm_ip2_handle), 0, 128
+        )
+        self._basic_inference(
+            shm_handles[0],
+            shm_ip2_handle,
+            shm_handles[2],
+            shm_handles[3],
+            error_msg,
+            "input2_data",
+            128,
+        )
         if len(error_msg) > 0:
             self.assertIn(
                 "unexpected total byte size 128 for input 'INPUT1', expecting 64",
-                error_msg[-1])
+                error_msg[-1],
+            )
         shm_handles.append(shm_ip2_handle)
         self._cleanup_server(shm_handles)
 
@@ -274,8 +297,9 @@ def test_mixed_raw_shm(self):
         error_msg = []
         shm_handles = self._configure_sever()
         input1_data = np.ones(shape=16, dtype=np.int32)
-        self._basic_inference(shm_handles[0], [input1_data], shm_handles[2],
-                              shm_handles[3], error_msg)
+        self._basic_inference(
+            shm_handles[0], [input1_data], shm_handles[2], shm_handles[3], error_msg
+        )
         if len(error_msg) > 0:
             raise Exception(error_msg[-1])
         self._cleanup_server(shm_handles)
@@ -301,8 +325,8 @@ def test_unregisterall(self):
         self._cleanup_server(shm_handles)
 
 
-if __name__ == '__main__':
-    _protocol = os.environ.get('CLIENT_TYPE', "http")
+if __name__ == "__main__":
+    _protocol = os.environ.get("CLIENT_TYPE", "http")
     if _protocol == "http":
         _url = "localhost:8000"
     else:
diff --git a/qa/L0_cuda_shared_memory/test.sh b/qa/L0_cuda_shared_memory/test.sh
old mode 100644
new mode 100755
index 2e1120c9b1..b011244174
--- a/qa/L0_cuda_shared_memory/test.sh
+++ b/qa/L0_cuda_shared_memory/test.sh
@@ -50,7 +50,7 @@ for i in \
         test_unregisterall; do
     for client_type in http grpc; do
         SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
-        SERVER_LOG="./$i.$client_type.serverlog"
+        SERVER_LOG="./$i.$client_type.server.log"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
             echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_custom_ops/cuda_op_test.py b/qa/L0_custom_ops/cuda_op_test.py
old mode 100644
new mode 100755
index d4389d67ad..896ed2adf0
--- a/qa/L0_custom_ops/cuda_op_test.py
+++ b/qa/L0_custom_ops/cuda_op_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -84,21 +87,22 @@
     input_data = np.arange(start=42, stop=42 + elements, dtype=np.int32)
 
     inputs = [
-        client_util.InferInput("in", input_data.shape,
-                               np_to_triton_dtype(input_data.dtype))
+        client_util.InferInput(
+            "in", input_data.shape, np_to_triton_dtype(input_data.dtype)
+        )
     ]
     inputs[0].set_data_from_numpy(input_data)
 
     results = client.infer(model_name, inputs)
-    output_data = results.as_numpy('out')
+    output_data = results.as_numpy("out")
     if output_data is None:
         print("error: expected 'out'")
         sys.exit(1)
 
     for i in range(elements):
         print(
-            str(i) + ": input " + str(input_data[i]) + ", output " +
-            str(output_data[i]))
+            str(i) + ": input " + str(input_data[i]) + ", output " + str(output_data[i])
+        )
         if output_data[i] != (input_data[i] + 1):
             print("error: incorrect value")
             sys.exit(1)
diff --git a/qa/L0_custom_ops/mod_op_test.py b/qa/L0_custom_ops/mod_op_test.py
old mode 100644
new mode 100755
index 62edd1e289..14855f7c40
--- a/qa/L0_custom_ops/mod_op_test.py
+++ b/qa/L0_custom_ops/mod_op_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -87,22 +90,32 @@
     inputs = []
     for i in range(len(input_data)):
         inputs.append(
-            client_util.InferInput("INPUT__{}".format(i), input_data[0].shape,
-                                   np_to_triton_dtype(input_data[0].dtype)))
+            client_util.InferInput(
+                "INPUT__{}".format(i),
+                input_data[0].shape,
+                np_to_triton_dtype(input_data[0].dtype),
+            )
+        )
         inputs[i].set_data_from_numpy(input_data[i])
 
     results = client.infer(model_name, inputs)
 
     # We expect 1 result of size 10 with alternating 1 and 0.
-    output_data = results.as_numpy('OUTPUT__0')
+    output_data = results.as_numpy("OUTPUT__0")
     if output_data is None:
         print("error: expected 'OUTPUT__0'")
         sys.exit(1)
 
     for i in range(elements):
         print(
-            str(i) + ": " + str(input_data[0][i]) + " % " +
-            str(input_data[1][i]) + " = " + str(output_data[i]))
-        if ((input_data[0][i] % input_data[1][i]) != output_data[i]):
+            str(i)
+            + ": "
+            + str(input_data[0][i])
+            + " % "
+            + str(input_data[1][i])
+            + " = "
+            + str(output_data[i])
+        )
+        if (input_data[0][i] % input_data[1][i]) != output_data[i]:
             print("error: incorrect value")
             sys.exit(1)
diff --git a/qa/L0_custom_ops/onnx_op_test.py b/qa/L0_custom_ops/onnx_op_test.py
old mode 100644
new mode 100755
index 6a3d5ebb53..9b246c8e31
--- a/qa/L0_custom_ops/onnx_op_test.py
+++ b/qa/L0_custom_ops/onnx_op_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -88,14 +91,16 @@
     inputs = []
     for i in range(len(input_data)):
         inputs.append(
-            client_util.InferInput("input_{}".format(i + 1), shape,
-                                   np_to_triton_dtype(dtype)))
+            client_util.InferInput(
+                "input_{}".format(i + 1), shape, np_to_triton_dtype(dtype)
+            )
+        )
         inputs[i].set_data_from_numpy(input_data[i])
 
     results = client.infer(model_name, inputs)
 
     # We expect 1 result of size 10 with alternating 1 and 0.
-    output_data = results.as_numpy('output')
+    output_data = results.as_numpy("output")
     if output_data is None:
         print("error: expected 'output'")
         sys.exit(1)
@@ -103,9 +108,12 @@
     for i in range(3):
         for j in range(5):
             print(
-                str(input_data[0][i][j]) + " + " + str(input_data[1][i][j]) +
-                " = " + str(output_data[i][j]))
-            if ((input_data[0][i][j] + input_data[1][i][j]) !=
-                    output_data[i][j]):
+                str(input_data[0][i][j])
+                + " + "
+                + str(input_data[1][i][j])
+                + " = "
+                + str(output_data[i][j])
+            )
+            if (input_data[0][i][j] + input_data[1][i][j]) != output_data[i][j]:
                 print("error: incorrect value")
                 sys.exit(1)
diff --git a/qa/L0_custom_ops/test.sh b/qa/L0_custom_ops/test.sh
index c4b50dd43d..a12c1d67a4 100755
--- a/qa/L0_custom_ops/test.sh
+++ b/qa/L0_custom_ops/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -57,9 +57,10 @@ RET=0
 
 # Must explicitly set LD_LIBRARY_PATH so that the custom operations
 # can find libtensorflow_framework.so.
-LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow1:$LD_LIBRARY_PATH
+LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow:$LD_LIBRARY_PATH
 
 # Tensorflow
+## Load operations via LD_PRELOAD
 SERVER_ARGS="--model-repository=/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops"
 SERVER_LD_PRELOAD="/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops/libzeroout.so:/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops/libcudaop.so:/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops/libbusyop.so"
 
@@ -105,13 +106,72 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+## Load operations via model config
+SERVER_ARGS="--model-repository=tf_custom_ops"
+SERVER_LD_PRELOAD=""
+
+rm -rf tf_custom_ops && \
+    mkdir -p tf_custom_ops && \
+    cp -r /data/inferenceserver/${REPO_VERSION}/qa_custom_ops/tf_custom_ops .
+
+for MODEL_TYPE in savedmodel graphdef; do
+    echo "model_operations { op_library_filename: \"tf_custom_ops/libbusyop.so\" }" >> tf_custom_ops/${MODEL_TYPE}_busyop/config.pbtxt
+    echo "model_operations { op_library_filename: \"tf_custom_ops/libcudaop.so\" }" >> tf_custom_ops/${MODEL_TYPE}_cudaop/config.pbtxt
+    echo "model_operations { op_library_filename: \"tf_custom_ops/libzeroout.so\" }" >> tf_custom_ops/${MODEL_TYPE}_zeroout/config.pbtxt
+done
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+python $ZERO_OUT_TEST -v -m graphdef_zeroout >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+python $ZERO_OUT_TEST -v -m savedmodel_zeroout >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+python $CUDA_OP_TEST -v -m graphdef_cudaop >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+python $CUDA_OP_TEST -v -m savedmodel_cudaop >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 # Must set LD_LIBRARY_PATH just for the server launch so that the
 # custom operations can find libtorch.so and other pytorch dependencies.
 LD_LIBRARY_PATH=/opt/tritonserver/backends/pytorch:$LD_LIBRARY_PATH
 
 # Pytorch
 SERVER_ARGS="--model-repository=/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/libtorch_custom_ops"
-SERVER_LD_PRELOAD="/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/libtorch_custom_ops/libtorch_modulo/custom_modulo.so"
+# FIXME: Pre-loading the python library system to satisfy the symbol definitions
+# as the custom op library is built with different python version within
+# pytorch container. See DLIS-4152.
+SERVER_LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libpython3.10.so.1:/data/inferenceserver/${REPO_VERSION}/qa_custom_ops/libtorch_custom_ops/libtorch_modulo/custom_modulo.so"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
diff --git a/qa/L0_custom_ops/vision_op_test.py b/qa/L0_custom_ops/vision_op_test.py
old mode 100644
new mode 100755
index c925dc19c0..88857c3d12
--- a/qa/L0_custom_ops/vision_op_test.py
+++ b/qa/L0_custom_ops/vision_op_test.py
@@ -27,46 +27,49 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
+
+import numpy as np
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
 from tritonclient.utils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
+        type=str,
+        required=False,
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -83,23 +86,26 @@
 
     inputs = []
     inputs.append(
-        client_util.InferInput("INPUT__0", input_data.shape,
-                               np_to_triton_dtype(input_data.dtype)))
+        client_util.InferInput(
+            "INPUT__0", input_data.shape, np_to_triton_dtype(input_data.dtype)
+        )
+    )
     inputs[0].set_data_from_numpy(input_data)
     inputs.append(
-        client_util.InferInput("INPUT__1", box_data.shape,
-                               np_to_triton_dtype(box_data.dtype)))
+        client_util.InferInput(
+            "INPUT__1", box_data.shape, np_to_triton_dtype(box_data.dtype)
+        )
+    )
     inputs[1].set_data_from_numpy(box_data)
 
     results = client.infer(model_name, inputs)
 
     # We expect 1 result of shape [1, 3, 5, 5].
-    output_data = results.as_numpy('OUTPUT__0')
+    output_data = results.as_numpy("OUTPUT__0")
     if output_data is None:
         print("error: expected 'OUTPUT__0'")
         sys.exit(1)
 
-    if (output_data.shape != (1, 3, 5, 5)):
-        print("error: incorrect shape " + str(output_data.shape) +
-              "for 'OUTPUT__0'")
+    if output_data.shape != (1, 3, 5, 5):
+        print("error: incorrect shape " + str(output_data.shape) + "for 'OUTPUT__0'")
         sys.exit(1)
diff --git a/qa/L0_custom_ops/zero_out_test.py b/qa/L0_custom_ops/zero_out_test.py
old mode 100644
new mode 100755
index ad87dc8f37..28d5d2c9e6
--- a/qa/L0_custom_ops/zero_out_test.py
+++ b/qa/L0_custom_ops/zero_out_test.py
@@ -27,47 +27,50 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import numpy as np
 import sys
 from builtins import range
+
+import numpy as np
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
 from tritonclientutils import np_to_triton_dtype
 
 FLAGS = None
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-u',
-                        '--url',
-                        type=str,
-                        required=False,
-                        default='localhost:8000',
-                        help='Inference server URL. Default is localhost:8000.')
     parser.add_argument(
-        '-i',
-        '--protocol',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-u",
+        "--url",
         type=str,
         required=False,
-        default='http',
-        help='Protocol ("http"/"grpc") used to ' +
-        'communicate with inference service. Default is "http".')
-    parser.add_argument('-m',
-                        '--model',
-                        type=str,
-                        required=True,
-                        help='Name of model.')
+        default="localhost:8000",
+        help="Inference server URL. Default is localhost:8000.",
+    )
+    parser.add_argument(
+        "-i",
+        "--protocol",
+        type=str,
+        required=False,
+        default="http",
+        help='Protocol ("http"/"grpc") used to '
+        + 'communicate with inference service. Default is "http".',
+    )
+    parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.")
 
     FLAGS = parser.parse_args()
     if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"):
-        print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format(
-            FLAGS.protocol))
+        print(
+            'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol)
+        )
         exit(1)
 
     client_util = httpclient if FLAGS.protocol == "http" else grpcclient
@@ -83,8 +86,9 @@
     input_data = np.arange(start=42, stop=42 + elements, dtype=np.int32)
 
     inputs = [
-        client_util.InferInput("to_zero", input_data.shape,
-                               np_to_triton_dtype(input_data.dtype))
+        client_util.InferInput(
+            "to_zero", input_data.shape, np_to_triton_dtype(input_data.dtype)
+        )
     ]
     inputs[0].set_data_from_numpy(input_data)
     results = client.infer(model_name, inputs)
@@ -97,8 +101,8 @@
 
     for i in range(elements):
         print(
-            str(i) + ": input " + str(input_data[i]) + ", output " +
-            str(output_data[i]))
+            str(i) + ": input " + str(input_data[i]) + ", output " + str(output_data[i])
+        )
         if (i == 0) and (input_data[i] != output_data[i]):
             print("error: incorrect value")
             sys.exit(1)
diff --git a/qa/L0_data_compression/test.sh b/qa/L0_data_compression/test.sh
old mode 100644
new mode 100755
index aa8b950fe5..28255f5f7b
--- a/qa/L0_data_compression/test.sh
+++ b/qa/L0_data_compression/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -55,7 +55,7 @@ set +e
 echo "All work and no play makes Jack a dull boy" >> raw_data
 python3 validation.py generate_compressed_data
 
-$DATA_COMPRESSOR_TEST >>$TEST_LOG 2>&1
+LD_LIBRARY_PATH=/opt/tritonserver/lib:${LD_LIBRARY_PATH} $DATA_COMPRESSOR_TEST >>$TEST_LOG 2>&1
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** Data Compression Test Failed\n***"
     RET=1
@@ -148,6 +148,9 @@ if [ $? -ne 0 ]; then
 fi
 set -e
 
+kill $SERVER_PID
+wait $SERVER_PID
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_data_compression/validation.py b/qa/L0_data_compression/validation.py
old mode 100644
new mode 100755
index 927c863952..a0e5cb1576
--- a/qa/L0_data_compression/validation.py
+++ b/qa/L0_data_compression/validation.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,8 +31,9 @@
 
 def generate_compressed_data():
     with open("raw_data", "rb") as f:
-        import zlib
         import gzip
+        import zlib
+
         raw_data = f.read()
         with open("deflate_compressed_data", "wb") as of:
             of.write(zlib.compress(raw_data))
@@ -40,8 +43,9 @@ def generate_compressed_data():
 
 def validate_compressed_data():
     with open("raw_data", "rb") as f:
-        import zlib
         import gzip
+        import zlib
+
         raw_data = f.read()
         with open("generated_deflate_compressed_data", "rb") as cf:
             decompressed_data = zlib.decompress(cf.read())
@@ -53,5 +57,5 @@ def validate_compressed_data():
                 exit(1)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     globals()[sys.argv[1]]()
diff --git a/qa/L0_decoupled/decoupled_test.py b/qa/L0_decoupled/decoupled_test.py
old mode 100644
new mode 100755
index bb2219b6f0..b78170cf63
--- a/qa/L0_decoupled/decoupled_test.py
+++ b/qa/L0_decoupled/decoupled_test.py
@@ -1,5 +1,6 @@
-#!/bin/bash
-# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,59 +27,94 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from functools import partial
-import numpy as np
-import queue
-import unittest
 import os
+import queue
 import time
-import test_util as tu
+import unittest
+from functools import partial
 
-import tritongrpcclient as grpcclient
-import tritonhttpclient as httpclient
-from tritonclientutils import InferenceServerException
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+from tritonclient.utils import InferenceServerException
 
 
 class UserData:
-
     def __init__(self):
-        self._completed_requests = queue.Queue()
+        self._response_queue = queue.Queue()
 
 
 def callback(user_data, result, error):
     if error:
-        user_data._completed_requests.put(error)
+        user_data._response_queue.put(error)
     else:
-        user_data._completed_requests.put(result)
+        user_data._response_queue.put(result)
 
 
 class DecoupledTest(tu.TestResultCollector):
-
     def setUp(self):
-        self.trials_ = [("repeat_int32", None), ("simple_repeat", None),
-                        ("sequence_repeat", None),
-                        ("fan_repeat", self._fan_validate),
-                        ("repeat_square", self._nested_validate),
-                        ("nested_square", self._nested_validate)]
+        self.trials_ = [
+            ("repeat_int32", None),
+            ("simple_repeat", None),
+            ("sequence_repeat", None),
+            ("fan_repeat", self._fan_validate),
+            ("repeat_square", self._nested_validate),
+            ("nested_square", self._nested_validate),
+        ]
         self.model_name_ = "repeat_int32"
 
         self.inputs_ = []
-        self.inputs_.append(grpcclient.InferInput('IN', [1], "INT32"))
-        self.inputs_.append(grpcclient.InferInput('DELAY', [1], "UINT32"))
-        self.inputs_.append(grpcclient.InferInput('WAIT', [1], "UINT32"))
+        self.inputs_.append(grpcclient.InferInput("IN", [1], "INT32"))
+        self.inputs_.append(grpcclient.InferInput("DELAY", [1], "UINT32"))
+        self.inputs_.append(grpcclient.InferInput("WAIT", [1], "UINT32"))
 
         self.outputs_ = []
-        self.outputs_.append(grpcclient.InferRequestedOutput('OUT'))
-        self.outputs_.append(grpcclient.InferRequestedOutput('IDX'))
+        self.outputs_.append(grpcclient.InferRequestedOutput("OUT"))
+        self.outputs_.append(grpcclient.InferRequestedOutput("IDX"))
         # Some trials only expect a subset of outputs
         self.requested_outputs_ = self.outputs_
 
-    def _stream_infer(self, request_count, request_delay, expected_count,
-                      delay_data, delay_factor, user_data, result_dict):
-        with grpcclient.InferenceServerClient(url="localhost:8001",
-                                              verbose=True) as triton_client:
+    # Client can receive a "triton_final_response" response parameter
+    # from Triton server that indicates when a response is the final response for
+    # its request.
+    #
+    # For non-decoupled models, there is a 1:1 request:response ratio, so every
+    # response is the final response, and this parameter is unnecessary.
+    #
+    # For decoupled models, there is a 1:N request:response ratio, so there may be
+    # more one response before receiving the "final" response.
+    #
+    # However, decoupled models have the unique property in that they can return
+    # a flags-only response to the server to indicate completion, which is not
+    # returned to the client by default (See TRITONBACKEND_ResponseFactorySendFlags).
+    #
+    # To forward this flags-only response to the client, users must opt-in to this
+    # behavior by adding the following argument:
+    # client.async_stream_infer(..., enable_empty_final_response=True).
+    #
+    # If the decoupled backend/model always sends the final response flag along
+    # with a non-null response, no opt-in is needed.
+    #
+    # With this behavior, the client can programmatically detect when all responses
+    # for an individual request have been received without knowing the expected
+    # number of responses in advance and without closing the stream.
+    def _stream_infer_with_params(
+        self,
+        request_count,
+        request_delay,
+        _,
+        delay_data,
+        delay_factor,
+        user_data,
+        result_dict,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
             # Establish stream
             triton_client.start_stream(callback=partial(callback, user_data))
             # Send specified many requests in parallel
@@ -89,7 +125,67 @@ def _stream_infer(self, request_count, request_delay, expected_count,
                     model_name=self.model_name_,
                     inputs=self.inputs_,
                     request_id=str(i),
-                    outputs=self.requested_outputs_)
+                    outputs=self.requested_outputs_,
+                    # Opt-in to receiving flags-only responses from model/backend
+                    # to help detect final responses for decoupled models.
+                    enable_empty_final_response=True,
+                )
+                # Update delay input in accordance with the scaling factor
+                delay_data = delay_data * delay_factor
+                delay_data = delay_data.astype(np.uint32)
+
+            # Retrieve results...
+            recv_count = 0
+            completed_requests = 0
+            while completed_requests < request_count:
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    response = data_item.get_response()
+                    # Request IDs should generally be provided with each request
+                    # to associate decoupled responses with their requests.
+                    if not response.id:
+                        raise ValueError(
+                            "No response id found. Was a request_id provided?"
+                        )
+
+                    # Detect final response. Parameters are oneof and we expect bool_param
+                    if response.parameters.get("triton_final_response").bool_param:
+                        completed_requests += 1
+
+                    # Only process non-empty response, ignore if empty (no outputs)
+                    if response.outputs:
+                        if response.id not in result_dict:
+                            result_dict[response.id] = []
+                        result_dict[response.id].append((recv_count, data_item))
+                        recv_count += 1
+
+    def _stream_infer(
+        self,
+        request_count,
+        request_delay,
+        expected_count,
+        delay_data,
+        delay_factor,
+        user_data,
+        result_dict,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            # Establish stream
+            triton_client.start_stream(callback=partial(callback, user_data))
+            # Send specified many requests in parallel
+            for i in range(request_count):
+                time.sleep((request_delay / 1000))
+                self.inputs_[1].set_data_from_numpy(delay_data)
+                triton_client.async_stream_infer(
+                    model_name=self.model_name_,
+                    inputs=self.inputs_,
+                    request_id=str(i),
+                    outputs=self.requested_outputs_,
+                )
                 # Update delay input in accordance with the scaling factor
                 delay_data = delay_data * delay_factor
                 delay_data = delay_data.astype(np.uint32)
@@ -97,12 +193,12 @@ def _stream_infer(self, request_count, request_delay, expected_count,
             # Retrieve results...
             recv_count = 0
             while recv_count < expected_count:
-                data_item = user_data._completed_requests.get()
+                data_item = user_data._response_queue.get()
                 if type(data_item) == InferenceServerException:
                     raise data_item
                 else:
                     this_id = data_item.get_response().id
-                    if this_id not in result_dict.keys():
+                    if this_id not in result_dict:
                         result_dict[this_id] = []
                     result_dict[this_id].append((recv_count, data_item))
 
@@ -113,7 +209,7 @@ def _fan_validate(self, result_list, data_offset, repeat_count):
         self.assertEqual(len(result_list), repeat_count)
         expected_data = 2 * data_offset
         for j in range(len(result_list)):
-            this_data = result_list[j][1].as_numpy('OUT')
+            this_data = result_list[j][1].as_numpy("OUT")
             self.assertEqual(len(this_data), 1)
             self.assertEqual(this_data[0], expected_data)
             expected_data += 2
@@ -121,13 +217,12 @@ def _fan_validate(self, result_list, data_offset, repeat_count):
     def _nested_validate(self, result_list, data_offset, repeat_count):
         # if repeat model returns repeat result n, repeat_square-like model
         # will return the same result n times
-        expected_len = sum(
-            x for x in range(data_offset, data_offset + repeat_count))
+        expected_len = sum(x for x in range(data_offset, data_offset + repeat_count))
         self.assertEqual(len(result_list), expected_len)
         expected_data = data_offset
         expected_count = expected_data
         for j in range(len(result_list)):
-            this_data = result_list[j][1].as_numpy('OUT')
+            this_data = result_list[j][1].as_numpy("OUT")
             self.assertEqual(len(this_data), 1)
             self.assertEqual(this_data[0], expected_data)
             expected_count -= 1
@@ -135,20 +230,22 @@ def _nested_validate(self, result_list, data_offset, repeat_count):
                 expected_data += 1
                 expected_count = expected_data
 
-    def _decoupled_infer(self,
-                         request_count,
-                         request_delay=0,
-                         repeat_count=1,
-                         data_offset=100,
-                         delay_time=1000,
-                         delay_factor=1,
-                         wait_time=500,
-                         order_sequence=None,
-                         validate_fn=None):
+    def _decoupled_infer(
+        self,
+        request_count,
+        request_delay=0,
+        repeat_count=1,
+        data_offset=100,
+        delay_time=1000,
+        delay_factor=1,
+        wait_time=500,
+        order_sequence=None,
+        validate_fn=None,
+    ):
         # Initialize data for IN
-        input_data = np.arange(start=data_offset,
-                               stop=data_offset + repeat_count,
-                               dtype=np.int32)
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
         self.inputs_[0].set_shape([repeat_count])
         self.inputs_[0].set_data_from_numpy(input_data)
 
@@ -161,54 +258,67 @@ def _decoupled_infer(self,
         self.inputs_[2].set_data_from_numpy(wait_data)
 
         # use validate_fn to differentiate requested outputs
-        self.requested_outputs_ = self.outputs_ if validate_fn is None else self.outputs_[
-            0:1]
+        self.requested_outputs_ = (
+            self.outputs_ if validate_fn is None else self.outputs_[0:1]
+        )
 
-        user_data = UserData()
-        result_dict = {}
+        for infer_helper in [self._stream_infer, self._stream_infer_with_params]:
+            user_data = UserData()
+            result_dict = {}
 
-        try:
-            if "square" not in self.model_name_:
-                expected_count = (repeat_count * request_count)
-            else:
-                expected_count = sum(
-                    x for x in range(data_offset, data_offset +
-                                     repeat_count)) * request_count
-            self._stream_infer(request_count, request_delay, expected_count,
-                               delay_data, delay_factor, user_data, result_dict)
-        except Exception as ex:
-            self.assertTrue(False, "unexpected error {}".format(ex))
-
-        # Validate the results..
-        for i in range(request_count):
-            this_id = str(i)
-            if repeat_count != 0 and this_id not in result_dict.keys():
-                self.assertTrue(
-                    False,
-                    "response for request id {} not received".format(this_id))
-            elif repeat_count == 0 and this_id in result_dict.keys():
-                self.assertTrue(
-                    False,
-                    "received unexpected response for request id {}".format(
-                        this_id))
-            if repeat_count != 0:
-                if validate_fn is None:
-                    self.assertEqual(len(result_dict[this_id]), repeat_count)
-                    expected_data = data_offset
-                    result_list = result_dict[this_id]
-                    for j in range(len(result_list)):
-                        if order_sequence is not None:
-                            self.assertEqual(result_list[j][0],
-                                             order_sequence[i][j])
-                        this_data = result_list[j][1].as_numpy('OUT')
-                        self.assertEqual(len(this_data), 1)
-                        self.assertEqual(this_data[0], expected_data)
-                        this_idx = result_list[j][1].as_numpy('IDX')
-                        self.assertEqual(len(this_idx), 1)
-                        self.assertEqual(this_idx[0], j)
-                        expected_data += 1
+            try:
+                if "square" not in self.model_name_:
+                    expected_count = repeat_count * request_count
                 else:
-                    validate_fn(result_dict[this_id], data_offset, repeat_count)
+                    expected_count = (
+                        sum(x for x in range(data_offset, data_offset + repeat_count))
+                        * request_count
+                    )
+                infer_helper(
+                    request_count,
+                    request_delay,
+                    expected_count,
+                    delay_data,
+                    delay_factor,
+                    user_data,
+                    result_dict,
+                )
+            except Exception as ex:
+                self.assertTrue(False, "unexpected error {}".format(ex))
+
+            # Validate the results..
+            for i in range(request_count):
+                this_id = str(i)
+                if repeat_count != 0 and this_id not in result_dict.keys():
+                    self.assertTrue(
+                        False, "response for request id {} not received".format(this_id)
+                    )
+                elif repeat_count == 0 and this_id in result_dict.keys():
+                    self.assertTrue(
+                        False,
+                        "received unexpected response for request id {}".format(
+                            this_id
+                        ),
+                    )
+                if repeat_count != 0:
+                    if validate_fn is None:
+                        self.assertEqual(len(result_dict[this_id]), repeat_count)
+                        expected_data = data_offset
+                        result_list = result_dict[this_id]
+                        for j in range(len(result_list)):
+                            if order_sequence is not None:
+                                self.assertEqual(
+                                    result_list[j][0], order_sequence[i][j]
+                                )
+                            this_data = result_list[j][1].as_numpy("OUT")
+                            self.assertEqual(len(this_data), 1)
+                            self.assertEqual(this_data[0], expected_data)
+                            this_idx = result_list[j][1].as_numpy("IDX")
+                            self.assertEqual(len(this_idx), 1)
+                            self.assertEqual(this_idx[0], j)
+                            expected_data += 1
+                    else:
+                        validate_fn(result_dict[this_id], data_offset, repeat_count)
 
     def test_one_to_none(self):
         # Test cases where each request generates no response.
@@ -218,13 +328,9 @@ def test_one_to_none(self):
         for trial in self.trials_:
             self.model_name_ = trial[0]
             # Single request case
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=0,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=1, repeat_count=0, validate_fn=trial[1])
             # Multiple request case
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=0,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=5, repeat_count=0, validate_fn=trial[1])
 
     def test_one_to_one(self):
         # Test cases where each request generates single response.
@@ -235,23 +341,15 @@ def test_one_to_one(self):
             self.model_name_ = trial[0]
             # Single request case
             # Release request before the response is delivered
-            self._decoupled_infer(request_count=1,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=1, wait_time=500, validate_fn=trial[1])
             # Release request after the response is delivered
-            self._decoupled_infer(request_count=1,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=1, wait_time=2000, validate_fn=trial[1])
 
             # Multiple request case
             # Release request before the response is delivered
-            self._decoupled_infer(request_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=5, wait_time=500, validate_fn=trial[1])
             # Release request after the response is delivered
-            self._decoupled_infer(request_count=5,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(request_count=5, wait_time=2000, validate_fn=trial[1])
 
     def test_one_to_many(self):
         # Test cases where each request generates multiple response.
@@ -264,37 +362,31 @@ def test_one_to_many(self):
             self.model_name_ = trial[0]
             # Single request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=2000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=10000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=10000, validate_fn=trial[1]
+            )
 
             # Multiple request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=2000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=2000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=10000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=10000, validate_fn=trial[1]
+            )
 
     def test_one_to_multi_many(self):
         # Test cases where each request generates multiple response but the
@@ -307,37 +399,31 @@ def test_one_to_multi_many(self):
             self.model_name_ = trial[0]
             # Single request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=8000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=8000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=1,
-                                  repeat_count=5,
-                                  wait_time=20000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=1, repeat_count=5, wait_time=20000, validate_fn=trial[1]
+            )
 
             # Multiple request case
             # Release request before the first response is delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=500,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=500, validate_fn=trial[1]
+            )
             # Release request when the responses are getting delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=3000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=3000, validate_fn=trial[1]
+            )
             # Release request after all the responses are delivered
-            self._decoupled_infer(request_count=5,
-                                  repeat_count=5,
-                                  wait_time=10000,
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=5, repeat_count=5, wait_time=10000, validate_fn=trial[1]
+            )
 
     def test_response_order(self):
         # Test the expected response order for different cases
@@ -348,51 +434,61 @@ def test_response_order(self):
             self.model_name_ = trial[0]
 
             # Case 1: Interleaved responses
-            self._decoupled_infer(request_count=2,
-                                  request_delay=500,
-                                  repeat_count=4,
-                                  order_sequence=[[0, 2, 4, 6], [1, 3, 5, 7]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=500,
+                repeat_count=4,
+                order_sequence=[[0, 2, 4, 6], [1, 3, 5, 7]],
+                validate_fn=trial[1],
+            )
 
             # Case 2: All responses of second request delivered before any
             # response from the first
-            self._decoupled_infer(request_count=2,
-                                  request_delay=500,
-                                  repeat_count=4,
-                                  delay_time=2000,
-                                  delay_factor=0.1,
-                                  order_sequence=[[4, 5, 6, 7], [0, 1, 2, 3]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=500,
+                repeat_count=4,
+                delay_time=2000,
+                delay_factor=0.1,
+                order_sequence=[[4, 5, 6, 7], [0, 1, 2, 3]],
+                validate_fn=trial[1],
+            )
 
             # Case 3: Similar to Case 2, but the second request is generated
             # after the first response from first request is received
-            self._decoupled_infer(request_count=2,
-                                  request_delay=2500,
-                                  repeat_count=4,
-                                  delay_time=2000,
-                                  delay_factor=0.1,
-                                  order_sequence=[[0, 5, 6, 7], [1, 2, 3, 4]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=2500,
+                repeat_count=4,
+                delay_time=2000,
+                delay_factor=0.1,
+                order_sequence=[[0, 5, 6, 7], [1, 2, 3, 4]],
+                validate_fn=trial[1],
+            )
 
             # Case 4: All the responses of second requests are dleivered after
             # all the responses from first requests are received
-            self._decoupled_infer(request_count=2,
-                                  request_delay=100,
-                                  repeat_count=4,
-                                  delay_time=500,
-                                  delay_factor=10,
-                                  order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=100,
+                repeat_count=4,
+                delay_time=500,
+                delay_factor=10,
+                order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
+                validate_fn=trial[1],
+            )
 
             # Case 5: Similar to Case 4, but the second request is generated
             # after the first response from the first request is received
-            self._decoupled_infer(request_count=2,
-                                  request_delay=750,
-                                  repeat_count=4,
-                                  delay_time=500,
-                                  delay_factor=10,
-                                  order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
-                                  validate_fn=trial[1])
+            self._decoupled_infer(
+                request_count=2,
+                request_delay=750,
+                repeat_count=4,
+                delay_time=500,
+                delay_factor=10,
+                order_sequence=[[0, 1, 2, 3], [4, 5, 6, 7]],
+                validate_fn=trial[1],
+            )
 
     def _no_streaming_helper(self, protocol):
         data_offset = 100
@@ -400,9 +496,9 @@ def _no_streaming_helper(self, protocol):
         delay_time = 1000
         wait_time = 2000
 
-        input_data = np.arange(start=data_offset,
-                               stop=data_offset + repeat_count,
-                               dtype=np.int32)
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
         delay_data = (np.ones([repeat_count], dtype=np.uint32)) * delay_time
         wait_data = np.array([wait_time], dtype=np.uint32)
 
@@ -412,12 +508,11 @@ def _no_streaming_helper(self, protocol):
             this_outputs = self.outputs_
         else:
             this_inputs = []
-            this_inputs.append(
-                httpclient.InferInput('IN', [repeat_count], "INT32"))
-            this_inputs.append(httpclient.InferInput('DELAY', [1], "UINT32"))
-            this_inputs.append(httpclient.InferInput('WAIT', [1], "UINT32"))
+            this_inputs.append(httpclient.InferInput("IN", [repeat_count], "INT32"))
+            this_inputs.append(httpclient.InferInput("DELAY", [1], "UINT32"))
+            this_inputs.append(httpclient.InferInput("WAIT", [1], "UINT32"))
             this_outputs = []
-            this_outputs.append(httpclient.InferRequestedOutput('OUT'))
+            this_outputs.append(httpclient.InferRequestedOutput("OUT"))
 
         # Initialize data for IN
         this_inputs[0].set_shape([repeat_count])
@@ -432,19 +527,22 @@ def _no_streaming_helper(self, protocol):
 
         if protocol == "grpc":
             triton_client = grpcclient.InferenceServerClient(
-                url="localhost:8001", verbose=True)
+                url="localhost:8001", verbose=True
+            )
         else:
             triton_client = httpclient.InferenceServerClient(
-                url="localhost:8000", verbose=True)
+                url="localhost:8000", verbose=True
+            )
 
         with self.assertRaises(InferenceServerException) as cm:
-            triton_client.infer(model_name=self.model_name_,
-                                inputs=this_inputs,
-                                outputs=this_outputs)
+            triton_client.infer(
+                model_name=self.model_name_, inputs=this_inputs, outputs=this_outputs
+            )
 
         self.assertIn(
             "doesn't support models with decoupled transaction policy",
-            str(cm.exception))
+            str(cm.exception),
+        )
 
     def test_no_streaming(self):
         # Test cases with no streaming inference. Server should give
@@ -463,9 +561,9 @@ def test_wrong_shape(self):
         delay_time = 1000
         wait_time = 2000
 
-        input_data = np.arange(start=data_offset,
-                               stop=data_offset + repeat_count,
-                               dtype=np.int32)
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
         delay_data = (np.ones([repeat_count + 1], dtype=np.uint32)) * delay_time
         wait_data = np.array([wait_time], dtype=np.uint32)
 
@@ -484,12 +582,14 @@ def test_wrong_shape(self):
         result_dict = {}
 
         with self.assertRaises(InferenceServerException) as cm:
-            self._stream_infer(1, 0, repeat_count, delay_data, 1, user_data,
-                               result_dict)
+            self._stream_infer(
+                1, 0, repeat_count, delay_data, 1, user_data, result_dict
+            )
 
-        self.assertIn("expected IN and DELAY shape to match, got [1] and [2]",
-                      str(cm.exception))
+        self.assertIn(
+            "expected IN and DELAY shape to match, got [1] and [2]", str(cm.exception)
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_decoupled/test.sh b/qa/L0_decoupled/test.sh
old mode 100644
new mode 100755
index 8fb5841997..90bb913b6c
--- a/qa/L0_decoupled/test.sh
+++ b/qa/L0_decoupled/test.sh
@@ -74,7 +74,7 @@ for trial in $TRIALS; do
       cat $SERVER_LOG
       exit 1
   fi
-  
+
   for i in \
               test_one_to_none \
               test_one_to_one \
@@ -82,7 +82,7 @@ for trial in $TRIALS; do
               test_no_streaming \
               test_response_order \
 	      test_wrong_shape; do
-  
+
       echo "Test: $i" >>$CLIENT_LOG
       set +e
       python $DECOUPLED_TEST DecoupledTest.$i >>$CLIENT_LOG 2>&1
@@ -100,11 +100,11 @@ for trial in $TRIALS; do
       fi
       set -e
   done
-  
+
   # Will delay the writing of each response by the specified many milliseconds.
   # This will ensure that there are multiple responses available to be written.
   export TRITONSERVER_DELAY_GRPC_RESPONSE=2000
-  
+
   echo "Test: test_one_to_multi_many" >>$CLIENT_LOG
   set +e
   python $DECOUPLED_TEST DecoupledTest.test_one_to_multi_many >>$CLIENT_LOG 2>&1
@@ -120,18 +120,18 @@ for trial in $TRIALS; do
           RET=1
       fi
   fi
-  
+
   set -e
-  
+
   unset TRITONSERVER_DELAY_GRPC_RESPONSE
-  
+
   kill $SERVER_PID
   wait $SERVER_PID
 done
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
-else 
+else
   echo -e "\n***\n*** Test Failed\n***"
 fi
 
diff --git a/qa/L0_device_memory_tracker/test.py b/qa/L0_device_memory_tracker/test.py
new file mode 100755
index 0000000000..1d443d1032
--- /dev/null
+++ b/qa/L0_device_memory_tracker/test.py
@@ -0,0 +1,109 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import time
+import unittest
+from functools import partial
+
+import nvidia_smi
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+
+
+class UnifiedClientProxy:
+    def __init__(self, client):
+        self.client_ = client
+
+    def __getattr__(self, attr):
+        forward_attr = getattr(self.client_, attr)
+        if type(self.client_) == grpcclient.InferenceServerClient:
+            if attr == "get_model_config":
+                return lambda *args, **kwargs: forward_attr(
+                    *args, **kwargs, as_json=True
+                )["config"]
+            elif attr == "get_inference_statistics":
+                return partial(forward_attr, as_json=True)
+        return forward_attr
+
+
+class MemoryUsageTest(unittest.TestCase):
+    def setUp(self):
+        nvidia_smi.nvmlInit()
+        self.gpu_handle_ = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
+        self.http_client_ = httpclient.InferenceServerClient(url="localhost:8000")
+        self.grpc_client_ = grpcclient.InferenceServerClient(url="localhost:8001")
+
+    def tearDown(self):
+        nvidia_smi.nvmlShutdown()
+
+    def report_used_gpu_memory(self):
+        info = nvidia_smi.nvmlDeviceGetMemoryInfo(self.gpu_handle_)
+        return info.used
+
+    def is_testing_backend(self, model_name, backend_name):
+        return self.client_.get_model_config(model_name)["backend"] == backend_name
+
+    def verify_recorded_usage(self, model_stat):
+        recorded_gpu_usage = 0
+        for usage in model_stat["memory_usage"]:
+            if usage["type"] == "GPU":
+                recorded_gpu_usage += int(usage["byte_size"])
+        # unload and verify recorded usage
+        before_total_usage = self.report_used_gpu_memory()
+        self.client_.unload_model(model_stat["name"])
+        # unload can return before the model is fully unloaded,
+        # wait to be finished
+        time.sleep(2)
+        usage_delta = before_total_usage - self.report_used_gpu_memory()
+        # check with tolerance as gpu usage obtained is overall usage
+        self.assertTrue(
+            usage_delta * 0.9 <= recorded_gpu_usage <= usage_delta * 1.1,
+            msg="For model {}, expect recorded usage to be in range [{}, {}], got {}".format(
+                model_stat["name"],
+                usage_delta * 0.9,
+                usage_delta * 1.1,
+                recorded_gpu_usage,
+            ),
+        )
+
+    def test_onnx_http(self):
+        self.client_ = UnifiedClientProxy(self.http_client_)
+        model_stats = self.client_.get_inference_statistics()["model_stats"]
+        for model_stat in model_stats:
+            if self.is_testing_backend(model_stat["name"], "onnxruntime"):
+                self.verify_recorded_usage(model_stat)
+
+    def test_plan_grpc(self):
+        self.client_ = UnifiedClientProxy(self.grpc_client_)
+        model_stats = self.client_.get_inference_statistics()["model_stats"]
+        for model_stat in model_stats:
+            if self.is_testing_backend(model_stat["name"], "tensorrt"):
+                self.verify_recorded_usage(model_stat)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_device_memory_tracker/test.sh b/qa/L0_device_memory_tracker/test.sh
new file mode 100755
index 0000000000..7eb0d745da
--- /dev/null
+++ b/qa/L0_device_memory_tracker/test.sh
@@ -0,0 +1,128 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+TEST_LOG="./test.log"
+TEST_PY=test.py
+
+DATADIR=/data/inferenceserver/${REPO_VERSION}
+rm -f *.log
+
+TEST_RESULT_FILE='test_results.txt'
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_LOG="./server.log"
+
+source ../common/util.sh
+
+RET=0
+
+# prepare model repository, only contains ONNX and TRT models as the
+# corresponding backend are known to be memory.
+rm -rf models && mkdir models
+# ONNX
+cp -r /data/inferenceserver/${REPO_VERSION}/onnx_model_store/* models/.
+rm -r models/*cpu
+
+# Convert to get TRT models against the system
+CAFFE2PLAN=../common/caffe2plan
+set +e
+mkdir -p models/vgg19_plan/1 && rm -f models/vgg19_plan/1/model.plan && \
+    $CAFFE2PLAN -b32 -n prob -o models/vgg19_plan/1/model.plan \
+                $DATADIR/caffe_models/vgg19.prototxt $DATADIR/caffe_models/vgg19.caffemodel
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to generate vgg19 PLAN\n***"
+    exit 1
+fi
+
+mkdir -p models/resnet50_plan/1 && rm -f models/resnet50_plan/1/model.plan && \
+    $CAFFE2PLAN -b32 -n prob -o models/resnet50_plan/1/model.plan \
+                $DATADIR/caffe_models/resnet50.prototxt $DATADIR/caffe_models/resnet50.caffemodel
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to generate resnet50 PLAN\n***"
+    exit 1
+fi
+
+mkdir -p models/resnet152_plan/1 && rm -f models/resnet152_plan/1/model.plan && \
+    $CAFFE2PLAN -h -b32 -n prob -o models/resnet152_plan/1/model.plan \
+                $DATADIR/caffe_models/resnet152.prototxt $DATADIR/caffe_models/resnet152.caffemodel
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Failed to generate resnet152 PLAN\n***"
+    exit 1
+fi
+set -e
+
+# Set multiple instances on selected model to test instance-wise collection
+# and accumulation.
+echo "instance_group [{ count: 2; kind: KIND_GPU }]" >> models/resnet152_plan/config.pbtxt
+echo "instance_group [{ count: 2; kind: KIND_GPU }]" >> models/densenet/config.pbtxt
+
+# testing use nvidia-smi for Python to validate the reported usage
+pip install nvidia-ml-py3
+
+# Start server to load all models (in parallel), then gradually unload
+# the models and expect the memory usage changes matches what are reported
+# in statistic.
+SERVER_ARGS="--backend-config=triton-backend-memory-tracker=true --model-repository=models --model-control-mode=explicit --load-model=*"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TEST_PY > $TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $SERVER_LOG
+    cat $TEST_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_backend_python/unittest/test.sh b/qa/L0_dlpack_multi_gpu/test.sh
old mode 100644
new mode 100755
similarity index 79%
rename from qa/L0_backend_python/unittest/test.sh
rename to qa/L0_dlpack_multi_gpu/test.sh
index e78b2613b0..2485bfdb88
--- a/qa/L0_backend_python/unittest/test.sh
+++ b/qa/L0_dlpack_multi_gpu/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,27 +27,33 @@
 
 SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1"
-CLIENT_PY=../python_unittest.py
+CLIENT_PY=./python_unittest.py
 CLIENT_LOG="./client.log"
 EXPECTED_NUM_TESTS="1"
 TEST_RESULT_FILE='test_results.txt'
 SERVER_LOG="./inference_server.log"
+export CUDA_VISIBLE_DEVICES=0,1,2,3
 
 RET=0
 rm -fr *.log ./models
 
-source ../../common/util.sh
+source ../common/util.sh
 
 # Uninstall the non CUDA version of PyTorch
 pip3 uninstall -y torch
-pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
 pip3 install tensorflow
 
+# Install CuPy for testing non_blocking compute streams
+pip3 install cupy-cuda12x
+
 rm -fr *.log ./models
 
 mkdir -p models/dlpack_test/1/
-cp ../../python_models/dlpack_test/model.py models/dlpack_test/1/
-cp ../../python_models/dlpack_test/config.pbtxt models/dlpack_test
+cp ../python_models/dlpack_test/model.py models/dlpack_test/1/
+cp ../python_models/dlpack_test/config.pbtxt models/dlpack_test
+cp ../L0_backend_python/python_unittest.py .
+sed -i 's#sys.path.append("../../common")#sys.path.append("../common")#g' python_unittest.py
 
 run_server
 if [ "$SERVER_PID" == "0" ]; then
@@ -58,7 +64,7 @@ fi
 
 set +e
 export MODEL_NAME="dlpack_test"
-python3 $CLIENT_PY > $CLIENT_LOG 2>&1 
+python3 $CLIENT_PY > $CLIENT_LOG 2>&1
 
 if [ $? -ne 0 ]; then
     echo -e "\n***\n*** python_unittest.py FAILED. \n***"
@@ -84,4 +90,5 @@ else
     echo -e "\n***\n*** Unittest test PASSED. \n***"
 fi
 
+export CUDA_VISIBLE_DEVICES=0
 exit $RET
diff --git a/qa/L0_doc_links/mkdocs.yml b/qa/L0_doc_links/mkdocs.yml
new file mode 100644
index 0000000000..1588680d92
--- /dev/null
+++ b/qa/L0_doc_links/mkdocs.yml
@@ -0,0 +1,44 @@
+# Copyright (c) 2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+site_name: CI Test
+use_directory_urls: False
+docs_dir: "./repos"
+plugins:
+        - htmlproofer
+        - search
diff --git a/qa/L0_doc_links/test.sh b/qa/L0_doc_links/test.sh
new file mode 100755
index 0000000000..be7d291b01
--- /dev/null
+++ b/qa/L0_doc_links/test.sh
@@ -0,0 +1,76 @@
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+LOG="`pwd`/log.txt"
+CONFIG="`pwd`/mkdocs.yml"
+RET=0
+# Download necessary packages
+python3 -m pip install mkdocs
+python3 -m pip install mkdocs-htmlproofer-plugin
+
+# Get the necessary repos
+mkdir repos && cd repos
+TRITON_BACKEND_REPO_TAG=${TRITON_BACKEND_REPO_TAG:="main"}
+echo ${TRITON_BACKEND_REPO_TAG}
+git clone --single-branch --depth=1 -b ${TRITON_BACKEND_REPO_TAG} https://github.com/triton-inference-server/backend.git
+cd ..
+
+exec mkdocs serve -f $CONFIG > $LOG &
+PID=$!
+# Time for the compilation to finish. This needs to be increased if other repos
+# are added to the test
+sleep 20
+
+until [[ (-z `pgrep mkdocs`) ]]; do
+    kill -2 $PID
+    sleep 2
+done
+
+if [[ ! -z `grep "invalid url" $LOG` ]]; then
+    cat $LOG
+    RET=1
+fi
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test PASSED\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+# exit $RET
diff --git a/qa/L0_dyna_implicit_state/test.sh b/qa/L0_dyna_implicit_state/test.sh
old mode 100644
new mode 100755
index e09a24d493..0721d5cd32
--- a/qa/L0_dyna_implicit_state/test.sh
+++ b/qa/L0_dyna_implicit_state/test.sh
@@ -25,12 +25,25 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
 export ENSEMBLES=0
 BACKENDS=${BACKENDS:="onnx plan"}
 export BACKENDS
 export IMPLICIT_STATE=1
 
-(cd ../L0_dyna_sequence_batcher/ && bash -ex test.sh)
+(cd ../L0_dyna_sequence_batcher/ && bash -ex test.sh $REPO_VERSION)
 RET=$?
 
 if [ $RET == 0 ]; then
diff --git a/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py b/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py
old mode 100644
new mode 100755
index 6fff86948c..f2c709469b
--- a/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py
+++ b/qa/L0_dyna_sequence_batcher/dyna_sequence_batcher_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,57 +30,55 @@
 
 sys.path.append("../common")
 
-from builtins import str
 import os
-import time
 import threading
+import time
 import unittest
+from builtins import str
+
 import numpy as np
-import test_util as tu
 import sequence_util as su
+import test_util as tu
 
-_test_system_shared_memory = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-_test_cuda_shared_memory = bool(
-    int(os.environ.get('TEST_CUDA_SHARED_MEMORY', 0)))
+_test_system_shared_memory = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+_test_cuda_shared_memory = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
-NO_BATCHING = (int(os.environ.get('NO_BATCHING', 0)) == 1)
+NO_BATCHING = int(os.environ.get("NO_BATCHING", 0)) == 1
 BACKENDS = os.environ.get(
-    'BACKENDS', "graphdef savedmodel libtorch onnx plan custom custom_string")
-IMPLICIT_STATE = (int(os.environ['IMPLICIT_STATE']) == 1)
+    "BACKENDS", "graphdef savedmodel libtorch onnx plan custom custom_string"
+)
+IMPLICIT_STATE = int(os.environ["IMPLICIT_STATE"]) == 1
 
-_trials = BACKENDS.split(' ')
+_trials = BACKENDS.split(" ")
 for backend in BACKENDS.split(" "):
     if NO_BATCHING:
-        if (backend != 'custom') and (backend != 'custom_string'):
+        if (backend != "custom") and (backend != "custom_string"):
             _trials += (backend + "_nobatch",)
 
 _ragged_batch_supported_trials = []
-if 'custom' in BACKENDS.split(' '):
-    _ragged_batch_supported_trials.append('custom')
+if "custom" in BACKENDS.split(" "):
+    _ragged_batch_supported_trials.append("custom")
 
 _protocols = ("http", "grpc")
 _max_sequence_idle_ms = 5000
 
 
 class DynaSequenceBatcherTest(su.SequenceBatcherTestUtil):
-
     def get_datatype(self, trial):
         return np.int32
 
-    def get_expected_result(self,
-                            expected_result,
-                            corrid,
-                            value,
-                            trial,
-                            flag_str=None):
+    def get_expected_result(self, expected_result, corrid, value, trial, flag_str=None):
         # Adjust the expected_result for models that
-        # couldn't implement the full accumulator. See
+        # could not implement the full accumulator. See
         # qa/common/gen_qa_dyna_sequence_models.py for more
         # information.
-        if ((("nobatch" not in trial) and ("custom" not in trial)) or \
-            ("graphdef" in trial) or ("plan" in trial) or ("onnx" in trial) or \
-            ("libtorch" in trial)):
+        if (
+            (("nobatch" not in trial) and ("custom" not in trial))
+            or ("graphdef" in trial)
+            or ("plan" in trial)
+            or ("onnx" in trial)
+            or ("libtorch" in trial)
+        ):
             expected_result = value
             if flag_str is not None:
                 if "start" in flag_str:
@@ -90,12 +90,9 @@ def get_expected_result(self,
                         expected_result += corrid
         return expected_result
 
-    def get_expected_result_implicit(self,
-                                     expected_result,
-                                     corrid,
-                                     value,
-                                     trial,
-                                     flag_str=None):
+    def get_expected_result_implicit(
+        self, expected_result, corrid, value, trial, flag_str=None
+    ):
         return expected_result
 
     def test_simple_sequence(self):
@@ -111,18 +108,22 @@ def test_simple_sequence(self):
 
                     self.check_setup(model_name)
                     self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                     os.environ)
+                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                     if "string" in trial:
-                        corrid = '52'
+                        corrid = "52"
                     else:
                         corrid = 52
 
-                    expected_result = self.get_expected_result(
-                        45 + int(corrid), corrid, 9, trial, "end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        45, corrid, 9, trial, "end")
+                    expected_result = (
+                        self.get_expected_result(
+                            45 + int(corrid), corrid, 9, trial, "end"
+                        )
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            45, corrid, 9, trial, "end"
+                        )
+                    )
 
                     self.check_sequence(
                         trial,
@@ -131,19 +132,26 @@ def test_simple_sequence(self):
                         corrid,
                         (4000, None),
                         # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                        (("start", 1, None, None), (None, 2, None, None),
-                         (None, 3, None, None), (None, 4, None, None),
-                         (None, 5, None, None), (None, 6, None, None),
-                         (None, 7, None, None), (None, 8, None, None),
-                         ("end", 9, None, None)),
+                        (
+                            ("start", 1, None, None),
+                            (None, 2, None, None),
+                            (None, 3, None, None),
+                            (None, 4, None, None),
+                            (None, 5, None, None),
+                            (None, 6, None, None),
+                            (None, 7, None, None),
+                            (None, 8, None, None),
+                            ("end", 9, None, None),
+                        ),
                         expected_result,
                         protocol,
-                        sequence_name="{}_{}".format(self._testMethodName,
-                                                     protocol))
+                        sequence_name="{}_{}".format(self._testMethodName, protocol),
+                    )
 
                     self.check_deferred_exception()
-                    self.check_status(model_name, {1: 9 * (idx + 1)},
-                                      9 * (idx + 1), 9 * (idx + 1))
+                    self.check_status(
+                        model_name, {1: 9 * (idx + 1)}, 9 * (idx + 1), 9 * (idx + 1)
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -160,18 +168,22 @@ def test_length1_sequence(self):
 
                     self.check_setup(model_name)
                     self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                     os.environ)
+                    self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                     if "string" in trial:
-                        corrid = '99'
+                        corrid = "99"
                     else:
                         corrid = 99
 
-                    expected_result = self.get_expected_result(
-                        42 + int(corrid), corrid, 42, trial, "start,end"
-                    ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                        42, corrid, 42, trial, "start,end")
+                    expected_result = (
+                        self.get_expected_result(
+                            42 + int(corrid), corrid, 42, trial, "start,end"
+                        )
+                        if not IMPLICIT_STATE
+                        else self.get_expected_result_implicit(
+                            42, corrid, 42, trial, "start,end"
+                        )
+                    )
 
                     self.check_sequence(
                         trial,
@@ -180,50 +192,60 @@ def test_length1_sequence(self):
                         corrid,
                         (4000, None),
                         # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay))
-                        (
-                            ("start,end", 42, None, None),),
+                        (("start,end", 42, None, None),),
                         expected_result,
                         protocol,
-                        sequence_name="{}_{}".format(self._testMethodName,
-                                                     protocol))
+                        sequence_name="{}_{}".format(self._testMethodName, protocol),
+                    )
 
                     self.check_deferred_exception()
-                    self.check_status(model_name, {1: (idx + 1)}, (idx + 1),
-                                      (idx + 1))
+                    self.check_status(model_name, {1: (idx + 1)}, (idx + 1), (idx + 1))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
-    def _multi_sequence_impl(self, trials, expected_batch_exec,
-                             expected_exec_cnt, sleep_secs, tensor_shapes):
+    def _multi_sequence_impl(
+        self, trials, expected_batch_exec, expected_exec_cnt, sleep_secs, tensor_shapes
+    ):
         for trial in trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
             precreated_shm0_handles = self.precreate_register_regions(
-                (1, 3), dtype, 0, tensor_shape=(tensor_shapes[0],))
+                (1, 3), dtype, 0, tensor_shape=(tensor_shapes[0],)
+            )
             precreated_shm1_handles = self.precreate_register_regions(
-                (11, 12, 13), dtype, 1, tensor_shape=(tensor_shapes[1],))
+                (11, 12, 13), dtype, 1, tensor_shape=(tensor_shapes[1],)
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 112, 113), dtype, 2, tensor_shape=(tensor_shapes[2],))
+                (111, 112, 113), dtype, 2, tensor_shape=(tensor_shapes[2],)
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3, tensor_shape=(tensor_shapes[3],))
+                (1111, 1112, 1113), dtype, 3, tensor_shape=(tensor_shapes[3],)
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004']
+                    corrids = ["1001", "1002", "1003", "1004"]
                 else:
                     corrids = [1001, 1002, 1003, 1004]
 
-                expected_result = self.get_expected_result(
-                    4 * tensor_shapes[0] +
-                    int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    4, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        4 * tensor_shapes[0] + int(corrids[0]),
+                        corrids[0],
+                        3,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        4, corrids[0], 3, trial, "end"
+                    )
+                )
 
                 threads = []
                 threads.append(
@@ -238,19 +260,30 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             # (flag_str, value, pre_delay_ms)
                             (("start", 1, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
+                            precreated_shm0_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[0]),
-                            'tensor_shape': (tensor_shapes[0],)
-                        }))
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[0]
+                            ),
+                            "tensor_shape": (tensor_shapes[0],),
+                        },
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    36 * tensor_shapes[1] +
-                    int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    36, corrids[1], 13, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        36 * tensor_shapes[1] + int(corrids[1]),
+                        corrids[1],
+                        13,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        36, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -261,22 +294,32 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             corrids[1],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11, None), (None, 12, None), ("end", 13,
-                                                                     None)),
+                            (("start", 11, None), (None, 12, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
+                            precreated_shm1_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[1]),
-                            'tensor_shape': (tensor_shapes[1],)
-                        }))
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[1]
+                            ),
+                            "tensor_shape": (tensor_shapes[1],),
+                        },
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    336 * tensor_shapes[2] +
-                    int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    336, corrids[2], 113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        336 * tensor_shapes[2] + int(corrids[2]),
+                        corrids[2],
+                        113,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        336, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -287,21 +330,35 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             corrids[2],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 111, None), (None, 112, None),
-                             ("end", 113, None)),
+                            (
+                                ("start", 111, None),
+                                (None, 112, None),
+                                ("end", 113, None),
+                            ),
                             expected_result,
-                            precreated_shm2_handles),
+                            precreated_shm2_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[2]),
-                            'tensor_shape': (tensor_shapes[2],)
-                        }))
-                expected_result = self.get_expected_result(
-                    3336 * tensor_shapes[3] +
-                    int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[2]
+                            ),
+                            "tensor_shape": (tensor_shapes[2],),
+                        },
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        3336 * tensor_shapes[3] + int(corrids[3]),
+                        corrids[3],
+                        1113,
+                        trial,
+                        "end",
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -312,16 +369,22 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, None),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, None),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
+                            precreated_shm3_handles,
+                        ),
                         kwargs={
-                            'sequence_name':
-                                "{}_{}".format(self._testMethodName,
-                                               corrids[3]),
-                            'tensor_shape': (tensor_shapes[3],)
-                        }))
+                            "sequence_name": "{}_{}".format(
+                                self._testMethodName, corrids[3]
+                            ),
+                            "tensor_shape": (tensor_shapes[3],),
+                        },
+                    )
+                )
 
                 for t in threads:
                     t.start()
@@ -330,8 +393,9 @@ def _multi_sequence_impl(self, trials, expected_batch_exec,
                 for t in threads:
                     t.join()
                 self.check_deferred_exception()
-                self.check_status(model_name, expected_batch_exec,
-                                  expected_exec_cnt, 11)
+                self.check_status(
+                    model_name, expected_batch_exec, expected_exec_cnt, 11
+                )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
             finally:
@@ -355,18 +419,18 @@ def test_multi_sequence_different_shape(self):
         # Send four sequences in parallel where the requests in each
         # sequence have different shape. Sequences should not be
         # batched due to input tensor size differences.
-        self._multi_sequence_impl(_ragged_batch_supported_trials, {1: 11}, 11,
-                                  0, (4, 3, 1, 2))
+        self._multi_sequence_impl(
+            _ragged_batch_supported_trials, {1: 11}, 11, 0, (4, 3, 1, 2)
+        )
 
     def test_multi_sequence_different_shape_allow_ragged(self):
         # Send four sequences in parallel where the requests in each
         # sequence have different shape. Input is marked as allowing
         # ragged and so sequences should be batched even with input
         # tensor size differences.
-        self._multi_sequence_impl(_ragged_batch_supported_trials, {
-            4: 2,
-            3: 1
-        }, 3, 1, (4, 3, 1, 2))
+        self._multi_sequence_impl(
+            _ragged_batch_supported_trials, {4: 2, 3: 1}, 3, 1, (4, 3, 1, 2)
+        )
 
     def test_backlog(self):
         # Send 5 equal-length sequences in parallel and make sure they
@@ -376,33 +440,42 @@ def test_backlog(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 2, 3),
-                                                                      dtype, 0)
+            precreated_shm0_handles = self.precreate_register_regions(
+                (1, 2, 3), dtype, 0
+            )
             precreated_shm1_handles = self.precreate_register_regions(
-                (11, 12, 13), dtype, 1)
+                (11, 12, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 112, 113), dtype, 2)
+                (111, 112, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3)
+                (1111, 1112, 1113), dtype, 3
+            )
             precreated_shm4_handles = self.precreate_register_regions(
-                (11111, 11112, 11113), dtype, 4)
+                (11111, 11112, 11113), dtype, 4
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005']
+                    corrids = ["1001", "1002", "1003", "1004", "1005"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005]
 
-                expected_result = self.get_expected_result(
-                    6 + int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    6, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        6 + int(corrids[0]), corrids[0], 3, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        6, corrids[0], 3, trial, "end"
+                    )
+                )
 
                 threads = []
                 threads.append(
@@ -415,18 +488,23 @@ def test_backlog(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                   None)),
+                            (("start", 1, None), (None, 2, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    36 + int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    36, corrids[1], 13, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        36 + int(corrids[1]), corrids[1], 13, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        36, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -437,18 +515,23 @@ def test_backlog(self):
                             corrids[1],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11, None), (None, 12, None), ("end", 13,
-                                                                     None)),
+                            (("start", 11, None), (None, 12, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    336 + int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    336, corrids[2], 113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        336 + int(corrids[2]), corrids[2], 113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        336, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -459,18 +542,27 @@ def test_backlog(self):
                             corrids[2],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 111, None), (None, 112, None),
-                             ("end", 113, None)),
+                            (
+                                ("start", 111, None),
+                                (None, 112, None),
+                                ("end", 113, None),
+                            ),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -481,18 +573,27 @@ def test_backlog(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, None),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, None),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
-                expected_result = self.get_expected_result(
-                    33336 + int(corrids[4]), corrids[4], 11113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    33336, corrids[4], 11113, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        33336 + int(corrids[4]), corrids[4], 11113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        33336, corrids[4], 11113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -503,13 +604,17 @@ def test_backlog(self):
                             corrids[4],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11111, None), (None, 11112, None),
-                             ("end", 11113, None)),
+                            (
+                                ("start", 11111, None),
+                                (None, 11112, None),
+                                ("end", 11113, None),
+                            ),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 for t in threads:
                     t.start()
@@ -534,35 +639,45 @@ def test_backlog_fill(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 2, 3),
-                                                                      dtype, 0)
-            precreated_shm1_handles = self.precreate_register_regions((11, 13),
-                                                                      dtype, 1)
+            precreated_shm0_handles = self.precreate_register_regions(
+                (1, 2, 3), dtype, 0
+            )
+            precreated_shm1_handles = self.precreate_register_regions(
+                (11, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 113), dtype, 2)
+                (111, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3)
-            precreated_shm4_handles = self.precreate_register_regions((11111,),
-                                                                      dtype, 4)
-            precreated_shm5_handles = self.precreate_register_regions((22222,),
-                                                                      dtype, 5)
+                (1111, 1112, 1113), dtype, 3
+            )
+            precreated_shm4_handles = self.precreate_register_regions(
+                (11111,), dtype, 4
+            )
+            precreated_shm5_handles = self.precreate_register_regions(
+                (22222,), dtype, 5
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005', '1006']
+                    corrids = ["1001", "1002", "1003", "1004", "1005", "1006"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005, 1006]
                 threads = []
 
-                expected_result = self.get_expected_result(
-                    6 + int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    6, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        6 + int(corrids[0]), corrids[0], 3, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        6, corrids[0], 3, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -573,17 +688,22 @@ def test_backlog_fill(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                   None)),
+                            (("start", 1, None), (None, 2, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    24 + int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    24, corrids[1], 13, trial, "end")
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        24 + int(corrids[1]), corrids[1], 13, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        24, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -596,14 +716,20 @@ def test_backlog_fill(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 11, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    224 + int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    224, corrids[2], 113, trial, "end")
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        224 + int(corrids[2]), corrids[2], 113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        224, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -616,14 +742,20 @@ def test_backlog_fill(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 111, None), ("end", 113, None)),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -634,18 +766,26 @@ def test_backlog_fill(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, 3000),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, 3000),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    11111 +
-                    int(corrids[4]), corrids[4], 11111, trial, "start,end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    11111, corrids[4], 11111, trial, "start,end")
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        11111 + int(corrids[4]), corrids[4], 11111, trial, "start,end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        11111, corrids[4], 11111, trial, "start,end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -656,18 +796,22 @@ def test_backlog_fill(self):
                             corrids[4],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (
-                                ("start,end", 11111, None),),
+                            (("start,end", 11111, None),),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    22222 +
-                    int(corrids[5]), corrids[5], 22222, trial, "start,end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    22222, corrids[5], 22222, trial, "start,end")
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        22222 + int(corrids[5]), corrids[5], 22222, trial, "start,end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        22222, corrids[5], 22222, trial, "start,end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -678,13 +822,13 @@ def test_backlog_fill(self):
                             corrids[5],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (
-                                ("start,end", 22222, None),),
+                            (("start,end", 22222, None),),
                             expected_result,
-                            precreated_shm5_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm5_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 threads[0].start()
                 threads[1].start()
@@ -716,35 +860,45 @@ def test_backlog_fill_no_end(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 2, 3),
-                                                                      dtype, 0)
-            precreated_shm1_handles = self.precreate_register_regions((11, 13),
-                                                                      dtype, 1)
+            precreated_shm0_handles = self.precreate_register_regions(
+                (1, 2, 3), dtype, 0
+            )
+            precreated_shm1_handles = self.precreate_register_regions(
+                (11, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 113), dtype, 2)
+                (111, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1113), dtype, 3)
-            precreated_shm4_handles = self.precreate_register_regions((11111,),
-                                                                      dtype, 4)
+                (1111, 1112, 1113), dtype, 3
+            )
+            precreated_shm4_handles = self.precreate_register_regions(
+                (11111,), dtype, 4
+            )
             precreated_shm5_handles = self.precreate_register_regions(
-                (22222, 22223, 22224), dtype, 5)
+                (22222, 22223, 22224), dtype, 5
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005', '1006']
+                    corrids = ["1001", "1002", "1003", "1004", "1005", "1006"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005, 1006]
                 threads = []
-                expected_result = self.get_expected_result(
-                    6 + int(corrids[0]), corrids[0], 3, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    6, corrids[0], 3, trial, "end")
+                expected_result = (
+                    self.get_expected_result(
+                        6 + int(corrids[0]), corrids[0], 3, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        6, corrids[0], 3, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -755,17 +909,22 @@ def test_backlog_fill_no_end(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None), (None, 2, None), ("end", 3,
-                                                                   None)),
+                            (("start", 1, None), (None, 2, None), ("end", 3, None)),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    24 + int(corrids[1]), corrids[1], 13, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    24, corrids[1], 13, trial, "end")
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        24 + int(corrids[1]), corrids[1], 13, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        24, corrids[1], 13, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -778,14 +937,20 @@ def test_backlog_fill_no_end(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 11, None), ("end", 13, None)),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    224 + int(corrids[2]), corrids[2], 113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    224, corrids[2], 113, trial, "end")
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        224 + int(corrids[2]), corrids[2], 113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        224, corrids[2], 113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -798,14 +963,20 @@ def test_backlog_fill_no_end(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 111, None), ("end", 113, None)),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    3336, corrids[3], 1113, trial, "end")
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        3336 + int(corrids[3]), corrids[3], 1113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        3336, corrids[3], 1113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -816,18 +987,26 @@ def test_backlog_fill_no_end(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112, 3000),
-                             ("end", 1113, None)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, 3000),
+                                ("end", 1113, None),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    11111 +
-                    int(corrids[4]), corrids[4], 11111, trial, "start,end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    11111, corrids[4], 11111, trial, "start,end")
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        11111 + int(corrids[4]), corrids[4], 11111, trial, "start,end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        11111, corrids[4], 11111, trial, "start,end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -838,17 +1017,22 @@ def test_backlog_fill_no_end(self):
                             corrids[4],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (
-                                ("start,end", 11111, None),),
+                            (("start,end", 11111, None),),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    66669 + int(corrids[5]), corrids[5], 22224, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    66669, corrids[5], 22224, trial, "end")
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        66669 + int(corrids[5]), corrids[5], 22224, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        66669, corrids[5], 22224, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -865,10 +1049,11 @@ def test_backlog_fill_no_end(self):
                                 ("end", 22224, 2000),
                             ),
                             expected_result,
-                            precreated_shm5_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm5_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 threads[0].start()
                 threads[1].start()
@@ -906,33 +1091,40 @@ def test_backlog_sequence_timeout(self):
         for trial in _trials:
             self.clear_deferred_exceptions()
             dtype = self.get_datatype(trial)
-            precreated_shm0_handles = self.precreate_register_regions((1, 3),
-                                                                      dtype, 0)
+            precreated_shm0_handles = self.precreate_register_regions((1, 3), dtype, 0)
             precreated_shm1_handles = self.precreate_register_regions(
-                (11, 12, 12, 13), dtype, 1)
+                (11, 12, 12, 13), dtype, 1
+            )
             precreated_shm2_handles = self.precreate_register_regions(
-                (111, 112, 112, 113), dtype, 2)
+                (111, 112, 112, 113), dtype, 2
+            )
             precreated_shm3_handles = self.precreate_register_regions(
-                (1111, 1112, 1112, 1113), dtype, 3)
+                (1111, 1112, 1112, 1113), dtype, 3
+            )
             precreated_shm4_handles = self.precreate_register_regions(
-                (11111, 11113), dtype, 4)
+                (11111, 11113), dtype, 4
+            )
             try:
                 model_name = tu.get_dyna_sequence_model_name(trial, dtype)
 
                 self.check_setup(model_name)
                 self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ)
-                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER",
-                                 os.environ)
+                self.assertNotIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ)
 
                 if "string" in trial:
-                    corrids = ['1001', '1002', '1003', '1004', '1005']
+                    corrids = ["1001", "1002", "1003", "1004", "1005"]
                 else:
                     corrids = [1001, 1002, 1003, 1004, 1005]
                 threads = []
-                expected_result = self.get_expected_result(
-                    4 + int(corrids[0]), corrids[0], 3, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    4, corrids[0], 3, trial, None)
+                expected_result = (
+                    self.get_expected_result(
+                        4 + int(corrids[0]), corrids[0], 3, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        4, corrids[0], 3, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -943,17 +1135,25 @@ def test_backlog_sequence_timeout(self):
                             corrids[0],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1, None),
-                             (None, 3, _max_sequence_idle_ms + 1000)),
+                            (
+                                ("start", 1, None),
+                                (None, 3, _max_sequence_idle_ms + 1000),
+                            ),
                             expected_result,
-                            precreated_shm0_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    48 + int(corrids[1]), corrids[1], 13, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    48, corrids[1], 13, trial, None)
+                            precreated_shm0_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        48 + int(corrids[1]), corrids[1], 13, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        48, corrids[1], 13, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -964,19 +1164,27 @@ def test_backlog_sequence_timeout(self):
                             corrids[1],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 11, None), (None, 12,
-                                                   _max_sequence_idle_ms / 2),
-                             (None, 12, _max_sequence_idle_ms / 2),
-                             ("end", 13, _max_sequence_idle_ms / 2)),
+                            (
+                                ("start", 11, None),
+                                (None, 12, _max_sequence_idle_ms / 2),
+                                (None, 12, _max_sequence_idle_ms / 2),
+                                ("end", 13, _max_sequence_idle_ms / 2),
+                            ),
                             expected_result,
-                            precreated_shm1_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    448 + int(corrids[2]), corrids[2], 113, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    448, corrids[2], 113, trial, None)
+                            precreated_shm1_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        448 + int(corrids[2]), corrids[2], 113, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        448, corrids[2], 113, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -987,19 +1195,27 @@ def test_backlog_sequence_timeout(self):
                             corrids[2],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 111, None), (None, 112,
-                                                    _max_sequence_idle_ms / 2),
-                             (None, 112, _max_sequence_idle_ms / 2),
-                             ("end", 113, _max_sequence_idle_ms / 2)),
+                            (
+                                ("start", 111, None),
+                                (None, 112, _max_sequence_idle_ms / 2),
+                                (None, 112, _max_sequence_idle_ms / 2),
+                                ("end", 113, _max_sequence_idle_ms / 2),
+                            ),
                             expected_result,
-                            precreated_shm2_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    4448 + int(corrids[3]), corrids[3], 1113, trial, None
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    4448, corrids[3], 1113, trial, None)
+                            precreated_shm2_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        4448 + int(corrids[3]), corrids[3], 1113, trial, None
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        4448, corrids[3], 1113, trial, None
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -1010,19 +1226,27 @@ def test_backlog_sequence_timeout(self):
                             corrids[3],
                             (None, None),
                             # (flag_str, value, pre_delay_ms)
-                            (("start", 1111, None), (None, 1112,
-                                                     _max_sequence_idle_ms / 2),
-                             (None, 1112, _max_sequence_idle_ms / 2),
-                             ("end", 1113, _max_sequence_idle_ms / 2)),
+                            (
+                                ("start", 1111, None),
+                                (None, 1112, _max_sequence_idle_ms / 2),
+                                (None, 1112, _max_sequence_idle_ms / 2),
+                                ("end", 1113, _max_sequence_idle_ms / 2),
+                            ),
                             expected_result,
-                            precreated_shm3_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
-                expected_result = self.get_expected_result(
-                    22224 + int(corrids[4]), corrids[4], 11113, trial, "end"
-                ) if not IMPLICIT_STATE else self.get_expected_result_implicit(
-                    22224, corrids[4], 11113, trial, "end")
+                            precreated_shm3_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
+                expected_result = (
+                    self.get_expected_result(
+                        22224 + int(corrids[4]), corrids[4], 11113, trial, "end"
+                    )
+                    if not IMPLICIT_STATE
+                    else self.get_expected_result_implicit(
+                        22224, corrids[4], 11113, trial, "end"
+                    )
+                )
                 threads.append(
                     threading.Thread(
                         target=self.check_sequence_async,
@@ -1035,10 +1259,11 @@ def test_backlog_sequence_timeout(self):
                             # (flag_str, value, pre_delay_ms)
                             (("start", 11111, None), ("end", 11113, None)),
                             expected_result,
-                            precreated_shm4_handles),
-                        kwargs={
-                            'sequence_name': "{}".format(self._testMethodName)
-                        }))
+                            precreated_shm4_handles,
+                        ),
+                        kwargs={"sequence_name": "{}".format(self._testMethodName)},
+                    )
+                )
 
                 threads[0].start()
                 threads[1].start()
@@ -1052,10 +1277,15 @@ def test_backlog_sequence_timeout(self):
                 self.check_deferred_exception()
                 self.assertTrue(False, "expected error")
             except Exception as ex:
-                self.assertTrue(ex.message().startswith(
-                    str("inference request for sequence 1001 to " +
-                        "model '{}' must specify the START flag on the first " +
-                        "request of the sequence").format(model_name)))
+                self.assertTrue(
+                    ex.message().startswith(
+                        str(
+                            "inference request for sequence 1001 to "
+                            + "model '{}' must specify the START flag on the first "
+                            + "request of the sequence"
+                        ).format(model_name)
+                    )
+                )
             finally:
                 if _test_system_shared_memory or _test_cuda_shared_memory:
                     self.cleanup_shm_regions(precreated_shm0_handles)
@@ -1065,5 +1295,5 @@ def test_backlog_sequence_timeout(self):
                     self.cleanup_shm_regions(precreated_shm4_handles)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_dyna_sequence_batcher/test.sh b/qa/L0_dyna_sequence_batcher/test.sh
index 8c62e3f630..acac8399af 100755
--- a/qa/L0_dyna_sequence_batcher/test.sh
+++ b/qa/L0_dyna_sequence_batcher/test.sh
@@ -65,15 +65,20 @@ fi
 
 RET=0
 
-rm -fr *.log *.serverlog
+rm -fr *.log
 
 # models
 rm -fr models && mkdir models
-cp -r ${DATADIR}/$MODEL_REPOSITORY/* models/.
+for MODEL in ${DATADIR}/$MODEL_REPOSITORY/* ; do
+    cp -r $MODEL models/. && \
+        (cd models/$(basename $MODEL) && \
+            sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt)
+done
 
 # Implicit state models for custom backend do not exist.
 if [ $IMPLICIT_STATE == "0" ]; then
     cp -r ../custom_models/custom_dyna_sequence_int32 models/.
+    sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" models/custom_dyna_sequence_int32/config.pbtxt
     # Construct custom dyna_sequence_model with STRING sequence ID. Copy model and edit config.pbtxt
     cp -r models/custom_dyna_sequence_int32 models/custom_string_dyna_sequence_int32
     sed -i "s/custom_dyna_sequence_int32/custom_string_dyna_sequence_int32/g" models/custom_string_dyna_sequence_int32/config.pbtxt
@@ -86,6 +91,7 @@ if [ $IMPLICIT_STATE == "0" ]; then
     rm -fr ragged_models && mkdir ragged_models
     cp -r ../custom_models/custom_dyna_sequence_int32 ragged_models/.
     (cd ragged_models/custom_dyna_sequence_int32 && \
+            sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt && \
             sed -i "s/name:.*\"INPUT\"/name: \"INPUT\"\\nallow_ragged_batch: true/" config.pbtxt)
 fi
 
@@ -98,7 +104,7 @@ for i in \
         test_simple_sequence \
         test_length1_sequence \
          ; do
-    SERVER_LOG="./$i.serverlog"
+    SERVER_LOG="./$i.server.log"
     SERVER_ARGS="--model-repository=`pwd`/models"
     run_server
     if [ "$SERVER_PID" == "0" ]; then
@@ -141,7 +147,7 @@ for i in \
         test_backlog_sequence_timeout \
     ; do
 
-    SERVER_LOG="./$i.serverlog"
+    SERVER_LOG="./$i.server.log"
     SERVER_ARGS="--model-repository=`pwd`/models"
     run_server
     if [ "$SERVER_PID" == "0" ]; then
@@ -180,7 +186,7 @@ if [ $IMPLICIT_STATE == "0" ]; then
         test_multi_sequence_different_shape_allow_ragged \
         ; do
 
-        SERVER_LOG="./$i.serverlog"
+        SERVER_LOG="./$i.server.log"
         SERVER_ARGS="--model-repository=`pwd`/ragged_models"
         run_server
         if [ "$SERVER_PID" == "0" ]; then
diff --git a/qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py b/qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py
new file mode 100644
index 0000000000..17c406b18e
--- /dev/null
+++ b/qa/L0_grpc/client_plugin_models/client_plugin_test/1/model.py
@@ -0,0 +1,63 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def execute(self, requests):
+        responses = []
+
+        for request in requests:
+            json_string = (
+                pb_utils.get_input_tensor_by_name(request, "EXPECTED_HEADERS")
+                .as_numpy()[0]
+                .decode("utf-8")
+            )
+            expected_headers = json.loads(json_string)
+
+            success = True
+            if request.parameters() != "":
+                parameters = json.loads(request.parameters())
+                for key, value in expected_headers.items():
+                    if key in parameters:
+                        if parameters[key] != value:
+                            success = False
+                    else:
+                        success = False
+
+            test_success = pb_utils.Tensor(
+                "TEST_SUCCESS", np.array([success], dtype=bool)
+            )
+            inference_response = pb_utils.InferenceResponse(
+                output_tensors=[test_success]
+            )
+            responses.append(inference_response)
+
+        return responses
diff --git a/docs/model_analyzer.md b/qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt
similarity index 63%
rename from docs/model_analyzer.md
rename to qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt
index 4f442e55cc..1bf368f795 100644
--- a/docs/model_analyzer.md
+++ b/qa/L0_grpc/client_plugin_models/client_plugin_test/config.pbtxt
@@ -1,4 +1,3 @@
-
 
-# Model Analyzer
+name: "client_plugin_test"
+backend: "python"
 
-The Triton Model Analyzer is a tool that uses [Performance
-Analyzer](perf_analyzer.md) to send requests to your model while
-measuring GPU memory and compute utilization. The Model Analyzer is
-specifically useful for characterizing the GPU memory requirements for
-your model under different batching and model instance
-configurations. Once you have this GPU memory usage information you
-can more intelligently decide on how to combine multiple models on the
-same GPU while remaining within the memory capacity of the GPU.
+input [
+  {
+    name: "EXPECTED_HEADERS"
+    data_type: TYPE_STRING
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "TEST_SUCCESS"
+    data_type: TYPE_BOOL
+    dims: [ 1 ]
+  }
+]
 
-For more information see the [Model Analyzer
-repository](https://github.com/triton-inference-server/model_analyzer)
-and the detailed explanation provided in [Maximizing Deep Learning
-Inference Performance with NVIDIA Model
-Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer).
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/L0_grpc/grpc_basic_auth_test.py b/qa/L0_grpc/grpc_basic_auth_test.py
new file mode 100755
index 0000000000..07d29ef5b7
--- /dev/null
+++ b/qa/L0_grpc/grpc_basic_auth_test.py
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import sys
+import unittest
+
+sys.path.append("../common")
+
+import test_util as tu
+import tritonclient.grpc as tritongrpcclient
+import tritonclient.grpc.aio as asynctritongrpcclient
+from tritonclient.grpc.aio.auth import BasicAuth as AsyncBasicAuth
+from tritonclient.grpc.auth import BasicAuth
+
+
+class GRPCBasicAuthTest(tu.TestResultCollector):
+    def setUp(self):
+        # Use the nginx port
+        self._client = tritongrpcclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(BasicAuth("username", "password"))
+
+    def test_client_call(self):
+        self.assertTrue(self._client.is_server_live())
+
+    def tearDown(self):
+        self._client.close()
+
+
+class GRPCBasicAuthAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        # Use the nginx port
+        self._client = asynctritongrpcclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(AsyncBasicAuth("username", "password"))
+
+    async def test_client_call(self):
+        self.assertTrue(await self._client.is_server_live())
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/grpc_client_plugin_test.py b/qa/L0_grpc/grpc_client_plugin_test.py
new file mode 100755
index 0000000000..1cc8c474ef
--- /dev/null
+++ b/qa/L0_grpc/grpc_client_plugin_test.py
@@ -0,0 +1,120 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import json
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as tritongrpcclient
+import tritonclient.grpc.aio as asynctritongrpcclient
+from tritonclient.grpc import InferenceServerClientPlugin
+from tritonclient.utils import np_to_triton_dtype
+
+
+# A simple plugin that adds headers to the inference request.
+class TestPlugin(InferenceServerClientPlugin):
+    def __init__(self, headers):
+        self._headers = headers
+
+    def __call__(self, request):
+        request.headers.update(self._headers)
+
+
+def prepare_infer_inputs(headers):
+    expected_headers = np.array([json.dumps(headers)], dtype=object)
+    inputs = []
+    inputs.append(
+        tritongrpcclient.InferInput(
+            "EXPECTED_HEADERS",
+            expected_headers.shape,
+            np_to_triton_dtype(expected_headers.dtype),
+        )
+    )
+    inputs[0].set_data_from_numpy(expected_headers)
+
+    return inputs
+
+
+class GRPCClientPluginAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        self._headers = {"my-key": "my-value"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = asynctritongrpcclient.InferenceServerClient(url="localhost:8001")
+
+    async def test_simple_infer(self):
+        model = "client_plugin_test"
+        inputs = prepare_infer_inputs(self._headers)
+        self._client.register_plugin(self._plugin)
+        response = await self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+        self._client.unregister_plugin()
+        inputs = prepare_infer_inputs({})
+        response = await self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+class GRPCClientPluginTest(tu.TestResultCollector):
+    def setUp(self):
+        self._headers = {"my-key": "my-value"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = tritongrpcclient.InferenceServerClient(url="localhost:8001")
+
+    def test_simple_infer(self):
+        # Set the binary data to False so that 'Inference-Header-Length' is not
+        # added to the headers.
+        model = "client_plugin_test"
+        inputs = prepare_infer_inputs(self._headers)
+        self._client.register_plugin(self._plugin)
+        self.assertEqual(self._plugin, self._client.plugin())
+        response = self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+        # Unregister the plugin
+        inputs = prepare_infer_inputs({})
+        self._client.unregister_plugin()
+        self.assertEqual(None, self._client.plugin())
+        response = self._client.infer(model_name=model, inputs=inputs)
+        test_success = response.as_numpy("TEST_SUCCESS")
+        self.assertEqual(test_success, True)
+
+    def tearDown(self):
+        self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/nginx.conf b/qa/L0_grpc/nginx.conf
new file mode 100644
index 0000000000..063d358c21
--- /dev/null
+++ b/qa/L0_grpc/nginx.conf
@@ -0,0 +1,54 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+worker_processes  1;
+
+error_log  /var/log/nginx/error.log;
+
+events {
+    worker_connections  1024;
+}
+
+http {
+    # Configure basic authentication
+    auth_basic "Restricted Content";
+    auth_basic_user_file /opt/tritonserver/qa/L0_grpc/pswd;
+
+    # Define upstream server
+    upstream backend {
+        server localhost:8001;
+    }
+
+    # Define server block for reverse proxy
+    server {
+        listen 8004 http2;
+
+        # Configure location for reverse proxy
+        location / {
+            grpc_pass grpc://backend;
+        }
+    }
+}
diff --git a/qa/L0_grpc/python_grpc_aio_test.py b/qa/L0_grpc/python_grpc_aio_test.py
new file mode 100755
index 0000000000..f342f19ad5
--- /dev/null
+++ b/qa/L0_grpc/python_grpc_aio_test.py
@@ -0,0 +1,125 @@
+#!/usr/bin/env python
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import unittest
+
+import tritonclient.grpc.aio as grpcclient
+from tritonclient.utils import *
+
+
+class TestGrpcAioClient(unittest.IsolatedAsyncioTestCase):
+    """Test if aio rpc can reach the server"""
+
+    def setUp(self):
+        self._triton_client = grpcclient.InferenceServerClient(url="localhost:8001")
+
+    async def asyncTearDown(self):
+        await self._triton_client.close()
+
+    async def test_is_server_live(self):
+        ret = await self._triton_client.is_server_live()
+        self.assertEqual(ret, True)
+
+    async def test_is_server_ready(self):
+        ret = await self._triton_client.is_server_ready()
+        self.assertEqual(ret, True)
+
+    async def test_is_model_ready(self):
+        ret = await self._triton_client.is_model_ready("simple")
+        self.assertEqual(ret, True)
+
+    async def test_get_server_metadata(self):
+        ret = await self._triton_client.get_server_metadata()
+        self.assertEqual(ret.name, "triton")
+
+        ret = await self._triton_client.get_server_metadata(as_json=True)
+        self.assertEqual(ret["name"], "triton")
+
+    async def test_get_model_metadata(self):
+        ret = await self._triton_client.get_model_metadata("simple")
+        self.assertEqual(ret.name, "simple")
+
+    async def test_get_model_config(self):
+        ret = await self._triton_client.get_model_config("simple")
+        self.assertEqual(ret.config.name, "simple")
+
+    async def test_get_model_repository_index(self):
+        ret = await self._triton_client.get_model_repository_index()
+        self.assertEqual(len(ret.models), 8)
+
+    async def test_load_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.UNAVAILABLE\] explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_unload_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.UNAVAILABLE\] explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_get_inference_statistics(self):
+        await self._triton_client.get_inference_statistics()
+
+    async def test_update_trace_settings(self):
+        await self._triton_client.update_trace_settings()
+
+    async def test_get_trace_settings(self):
+        await self._triton_client.get_trace_settings()
+
+    async def test_get_system_shared_memory_status(self):
+        await self._triton_client.get_system_shared_memory_status()
+
+    async def test_register_system_shared_memory(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.INTERNAL\] Unable to open shared memory region: ''",
+        ):
+            await self._triton_client.register_system_shared_memory("", "", 0)
+
+    async def test_unregister_system_shared_memory(self):
+        await self._triton_client.unregister_system_shared_memory()
+
+    async def test_get_cuda_shared_memory_status(self):
+        await self._triton_client.get_cuda_shared_memory_status()
+
+    async def test_register_cuda_shared_memory(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "\[StatusCode\.INVALID_ARGUMENT\] failed to register CUDA shared memory region '': failed to open CUDA IPC handle: invalid argument",
+        ):
+            await self._triton_client.register_cuda_shared_memory("", b"", 0, 0)
+
+    async def test_unregister_cuda_shared_memory(self):
+        await self._triton_client.unregister_cuda_shared_memory()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/python_unit_test.py b/qa/L0_grpc/python_unit_test.py
new file mode 100755
index 0000000000..9591d4274c
--- /dev/null
+++ b/qa/L0_grpc/python_unit_test.py
@@ -0,0 +1,159 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import queue
+import time
+import unittest
+
+# For stream infer test
+from functools import partial
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._completed_requests.put(error)
+    else:
+        user_data._completed_requests.put(result)
+
+
+class RestrictedProtocolTest(unittest.TestCase):
+    def setUp(self):
+        self.client_ = grpcclient.InferenceServerClient(url="localhost:8001")
+        self.model_name_ = "simple"
+        self.prefix_ = "triton-grpc-protocol-"
+
+    # Other unspecified protocols should not be restricted
+    def test_sanity(self):
+        self.client_.get_inference_statistics("simple")
+        self.client_.get_inference_statistics(
+            "simple", headers={self.prefix_ + "infer-key": "infer-value"}
+        )
+
+    # health, infer, model repository protocols are restricted.
+    # health and infer expects "triton-grpc-restricted-infer-key : infer-value" header,
+    # model repository expected "triton-grpc-restricted-admin-key : admin-value".
+    def test_model_repository(self):
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            self.client_.unload_model(
+                self.model_name_, headers={self.prefix_ + "infer-key": "infer-value"}
+            )
+        # Request go through and get actual transaction error
+        with self.assertRaisesRegex(
+            InferenceServerException, "explicit model load / unload is not allowed"
+        ):
+            self.client_.unload_model(
+                self.model_name_, headers={self.prefix_ + "admin-key": "admin-value"}
+            )
+
+    def test_health(self):
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            self.client_.is_server_live()
+        self.client_.is_server_live({self.prefix_ + "infer-key": "infer-value"})
+
+    def test_infer(self):
+        # setup
+        inputs = [
+            grpcclient.InferInput("INPUT0", [1, 16], "INT32"),
+            grpcclient.InferInput("INPUT1", [1, 16], "INT32"),
+        ]
+        inputs[0].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+
+        # This test only care if the request goes through
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            _ = self.client_.infer(
+                model_name=self.model_name_, inputs=inputs, headers={"test": "1"}
+            )
+        self.client_.infer(
+            model_name=self.model_name_,
+            inputs=inputs,
+            headers={self.prefix_ + "infer-key": "infer-value"},
+        )
+
+    def test_stream_infer(self):
+        # setup
+        inputs = [
+            grpcclient.InferInput("INPUT0", [1, 16], "INT32"),
+            grpcclient.InferInput("INPUT1", [1, 16], "INT32"),
+        ]
+        inputs[0].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        user_data = UserData()
+        # The server can't interfere with whether GRPC should create the stream,
+        # server will be notified after the stream is established and only
+        # until then be able to access metadata to decide whether to continue
+        # the stream.
+        # So on client side, it will always perceive that the stream is
+        # successfully created and can only check its health at a later time.
+        self.client_.start_stream(partial(callback, user_data), headers={"test": "1"})
+        # wait for sufficient round-trip time
+        time.sleep(1)
+        with self.assertRaisesRegex(
+            InferenceServerException, "The stream is no longer in valid state"
+        ):
+            self.client_.async_stream_infer(model_name=self.model_name_, inputs=inputs)
+        # callback should record error detail
+        self.assertFalse(user_data._completed_requests.empty())
+        with self.assertRaisesRegex(
+            InferenceServerException, "This protocol is restricted"
+        ):
+            raise user_data._completed_requests.get()
+
+        self.assertTrue(user_data._completed_requests.empty())
+
+        # Stop and start new stream with proper header
+        self.client_.stop_stream()
+        self.client_.start_stream(
+            partial(callback, user_data),
+            headers={self.prefix_ + "infer-key": "infer-value"},
+        )
+        self.client_.async_stream_infer(model_name=self.model_name_, inputs=inputs)
+        # wait for response
+        time.sleep(1)
+        self.assertFalse(user_data._completed_requests.empty())
+        self.assertNotEqual(
+            type(user_data._completed_requests.get()), InferenceServerException
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_grpc/test.sh b/qa/L0_grpc/test.sh
old mode 100644
new mode 100755
index 70eb3bf561..73b9710a71
--- a/qa/L0_grpc/test.sh
+++ b/qa/L0_grpc/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,16 +42,22 @@ export CUDA_VISIBLE_DEVICES=0
 
 RET=0
 
+CLIENT_PLUGIN_TEST="./grpc_client_plugin_test.py"
+BASIC_AUTH_TEST="./grpc_basic_auth_test.py"
+NGINX_CONF="./nginx.conf"
 # On windows the paths invoked by the script (running in WSL) must use
 # /mnt/c when needed but the paths on the tritonserver command-line
 # must be C:/ style.
 if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
     SDKDIR=${SDKDIR:=C:/sdk}
     MODELDIR=${MODELDIR:=C:/models}
+    CLIENT_PLUGIN_MODELDIR=${MODELDIR:=C:/client_plugin_models}
     DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"}
     BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends}
     SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe}
 
+    SIMPLE_AIO_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_aio_infer_client.py
+    SIMPLE_AIO_STREAM_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_aio_sequence_stream_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=${SDKDIR}/python/simple_grpc_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=${SDKDIR}/python/simple_grpc_async_infer_client.py
@@ -91,11 +97,14 @@ if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
     CC_UNIT_TEST=${SDKDIR}/python/cc_client_test
 else
     MODELDIR=${MODELDIR:=`pwd`/models}
+    CLIENT_PLUGIN_MODELDIR=${CLIENTPLUGINMODELDIR:=`pwd`/client_plugin_models}
     DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
     TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
 
+    SIMPLE_AIO_INFER_CLIENT_PY=../clients/simple_grpc_aio_infer_client.py
+    SIMPLE_AIO_STREAM_INFER_CLIENT_PY=../clients/simple_grpc_aio_sequence_stream_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=../clients/simple_grpc_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=../clients/simple_grpc_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=../clients/simple_grpc_async_infer_client.py
@@ -133,6 +142,7 @@ else
     SIMPLE_CUSTOM_ARGS_CLIENT=../clients/simple_grpc_custom_args_client
     CC_UNIT_TEST=../clients/cc_client_test
 fi
+PYTHON_UNIT_TEST=python_unit_test.py
 
 # Add string_dyna_sequence model to repo
 cp -r ${MODELDIR}/simple_dyna_sequence ${MODELDIR}/simple_string_dyna_sequence
@@ -168,6 +178,8 @@ fi
 
 IMAGE=../images/vulture.jpeg
 for i in \
+        $SIMPLE_AIO_INFER_CLIENT_PY \
+        $SIMPLE_AIO_STREAM_INFER_CLIENT_PY \
         $SIMPLE_INFER_CLIENT_PY \
         $SIMPLE_ASYNC_INFER_CLIENT_PY \
         $SIMPLE_STRING_INFER_CLIENT_PY \
@@ -327,6 +339,37 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${CLIENT_PLUGIN_MODELDIR} --http-header-forward-pattern=.* --grpc-header-forward-pattern=.*"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python3 $CLIENT_PLUGIN_TEST >> ${CLIENT_LOG}.python.plugin 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin
+    RET=1
+fi
+set -e
+
+# Create a password file with username:password
+echo -n 'username:' > pswd
+echo "password" | openssl passwd -stdin -apr1 >> pswd
+nginx -c `pwd`/$NGINX_CONF
+
+python3 $BASIC_AUTH_TEST
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin.auth
+    RET=1
+fi
+service nginx stop
+
+kill $SERVER_PID
+wait $SERVER_PID
+
 export GRPC_TRACE=compression, channel
 export GRPC_VERBOSITY=DEBUG
 SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR} --grpc-infer-response-compression-level=high"
@@ -386,6 +429,10 @@ if [ $(cat ${CLIENT_LOG}.model_control | grep "PASS" | wc -l) -ne 1 ]; then
     cat ${CLIENT_LOG}.model_control
     RET=1
 fi
+if [ $(cat ${SERVER_LOG} | grep "Invalid config override" | wc -l) -eq 0 ]; then
+    cat ${SERVER_LOG}
+    RET=1
+fi
 set -e
 
 kill $SERVER_PID
@@ -443,7 +490,7 @@ wait $SERVER_PID
 # Run cpp client unit test
 rm -rf unit_test_models && mkdir unit_test_models
 cp -r $DATADIR/qa_model_repository/onnx_int32_int32_int32 unit_test_models/.
-cp -r ${MODELDIR}/simple unit_test_models/. 
+cp -r ${MODELDIR}/simple unit_test_models/.
 
 SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=unit_test_models
             --trace-file=global_unittest.log --trace-level=TIMESTAMPS --trace-rate=1"
@@ -481,21 +528,138 @@ SERVER_ARGS="--model-repository=`pwd`/unit_test_models \
              --strict-model-config=false"
 SERVER_LOG="./inference_server_cc_unit_test.load.log"
 CLIENT_LOG="./cc_unit_test.load.log"
+
+for i in \
+   "LoadWithFileOverride" \
+   "LoadWithConfigOverride" \
+   ; do
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    $CC_UNIT_TEST --gtest_filter=GRPC*$i >> ${CLIENT_LOG}.$i 2>&1
+    if [ $? -ne 0 ]; then
+        cat ${CLIENT_LOG}.$i
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+# Run python grpc aio unit test
+PYTHON_GRPC_AIO_TEST=python_grpc_aio_test.py
+CLIENT_LOG=`pwd`/python_grpc_aio_test.log
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
     exit 1
 fi
-
 set +e
-$CC_UNIT_TEST --gtest_filter=GRPC*Load* >> ${CLIENT_LOG} 2>&1
+python $PYTHON_GRPC_AIO_TEST > $CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
-    cat ${CLIENT_LOG}
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python GRPC AsyncIO Test Failed\n***"
     RET=1
 fi
 set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test GRPC health check implemented
+go install github.com/grpc-ecosystem/grpc-health-probe@latest
+HEALTH_PROBE="${GOPATH}/bin/grpc-health-probe -addr=localhost:8001"
 
+CLIENT_LOG=`pwd`/grpc_health_probe_offline.log
+set +e
+$HEALTH_PROBE > $CLIENT_LOG 2>&1
+set -e
+if [ `grep -c "timeout: failed to connect service" ${CLIENT_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected health check timeout\n***"
+    cat $CLIENT_LOG
+    RET=1
+fi
+
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+CLIENT_LOG=`pwd`/grpc_health_probe_online.log
+set +e
+$HEALTH_PROBE > $CLIENT_LOG 2>&1
+set -e
+if [ `grep -c "status: SERVING" ${CLIENT_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected health check to return SERVING\n***"
+    cat $CLIENT_LOG
+    RET=1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Repeated protocol, not allowed
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --grpc-restricted-protocol=model-repository,health:k1=v1 \
+             --grpc-restricted-protocol=metadata,health:k2=v2"
+run_server
+EXPECTED_MSG="protocol 'health' can not be specified in multiple config groups"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+# Unknown protocol, not allowed
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --grpc-restricted-protocol=model-reposit,health:k1=v1 \
+             --grpc-restricted-protocol=metadata,health:k2=v2"
+run_server
+EXPECTED_MSG="unknown restricted protocol 'model-reposit'"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+# Test restricted protocols
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --grpc-restricted-protocol=model-repository:admin-key=admin-value \
+             --grpc-restricted-protocol=inference,health:infer-key=infer-value"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+set +e
+python $PYTHON_UNIT_TEST RestrictedProtocolTest > $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python GRPC Restricted Protocol Test Failed\n***"
+    RET=1
+fi
+set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
@@ -506,3 +670,4 @@ else
 fi
 
 exit $RET
+
diff --git a/qa/L0_grpc_state_cleanup/cleanup_test.py b/qa/L0_grpc_state_cleanup/cleanup_test.py
new file mode 100755
index 0000000000..89af756a8b
--- /dev/null
+++ b/qa/L0_grpc_state_cleanup/cleanup_test.py
@@ -0,0 +1,560 @@
+#!/usr/bin/env python3
+
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import os
+import queue
+import signal
+import time
+import unittest
+from functools import partial
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+
+class UserData:
+    def __init__(self):
+        self._response_queue = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._response_queue.put(error)
+    else:
+        user_data._response_queue.put(result)
+
+
+# These state cleanup tests relies on the test.sh
+# to check whether all the created request objects
+# were properly deleted by the sever.
+# The purpose on these unittest is to exercise
+# different portions of the gRPC frontend and
+# and track the state objects.
+class CleanUpTest(tu.TestResultCollector):
+    SERVER_PID = None
+
+    def setUp(self):
+        self.decoupled_model_name_ = "repeat_int32"
+        self.identity_model_name_ = "custom_zero_1_float32"
+
+    def _prepare_inputs_and_outputs(self, kind):
+        if kind == "decoupled_streaming":
+            self.inputs_ = []
+            self.inputs_.append(grpcclient.InferInput("IN", [1], "INT32"))
+            self.inputs_.append(grpcclient.InferInput("DELAY", [1], "UINT32"))
+            self.inputs_.append(grpcclient.InferInput("WAIT", [1], "UINT32"))
+
+            self.outputs_ = []
+            self.outputs_.append(grpcclient.InferRequestedOutput("OUT"))
+            self.outputs_.append(grpcclient.InferRequestedOutput("IDX"))
+            self.requested_outputs_ = self.outputs_
+        elif kind == "simple" or kind == "streaming":
+            self.inputs_ = []
+            self.inputs_.append(grpcclient.InferInput("INPUT0", [1, 1], "FP32"))
+
+            self.outputs_ = []
+            self.outputs_.append(grpcclient.InferRequestedOutput("OUTPUT0"))
+            self.requested_outputs_ = self.outputs_
+        else:
+            raise ValueError("Unsupported kind specified to prepare inputs/outputs")
+
+    def _simple_infer(
+        self,
+        request_count,
+        cancel_response_idx=None,
+        client_timeout_pair=None,
+        kill_server=None,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            self._prepare_inputs_and_outputs("simple")
+
+            input_data = np.array([[1.0]], dtype=np.float32)
+            self.inputs_[0].set_data_from_numpy(input_data)
+
+            user_data = UserData()
+
+            futures = []
+            timeout_idx = None
+            timeout_value = None
+            if client_timeout_pair:
+                timeout_idx, timeout_value = client_timeout_pair
+            for i in range(request_count):
+                if kill_server == i:
+                    os.kill(int(self.SERVER_PID), signal.SIGINT)
+                this_timeout = None
+                if timeout_idx == i:
+                    this_timeout = timeout_value
+                futures.append(
+                    triton_client.async_infer(
+                        model_name=self.identity_model_name_,
+                        inputs=self.inputs_,
+                        request_id=str(i),
+                        callback=partial(callback, user_data),
+                        outputs=self.requested_outputs_,
+                        client_timeout=this_timeout,
+                    )
+                )
+
+            if cancel_response_idx is not None:
+                futures[cancel_response_idx].cancel()
+
+            responses = []
+            while len(responses) < len(futures):
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    responses.append(data_item)
+
+            for response in responses:
+                output0_data = response.as_numpy("OUTPUT0")
+                self.assertTrue(np.array_equal(input_data, output0_data))
+
+    def _stream_infer_with_params(
+        self,
+        request_count,
+        request_delay,
+        _,
+        user_data,
+        result_dict,
+        delay_data=None,
+        delay_factor=None,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            # Establish stream
+            triton_client.start_stream(
+                callback=partial(callback, user_data), stream_timeout=stream_timeout
+            )
+            # Send specified many requests in parallel
+            for i in range(request_count):
+                time.sleep((request_delay / 1000))
+                self.inputs_[1].set_data_from_numpy(delay_data)
+                if kill_server == i:
+                    os.kill(int(self.SERVER_PID), signal.SIGINT)
+                triton_client.async_stream_infer(
+                    model_name=self.decoupled_model_name_,
+                    inputs=self.inputs_,
+                    request_id=str(i),
+                    outputs=self.requested_outputs_,
+                    # Opt-in to receiving flags-only responses from model/backend
+                    # to help detect final responses for decoupled models.
+                    enable_empty_final_response=True,
+                )
+                # Update delay input in accordance with the scaling factor
+                delay_data = delay_data * delay_factor
+                delay_data = delay_data.astype(np.uint32)
+
+            # Retrieve results...
+            recv_count = 0
+            completed_requests = 0
+            while completed_requests < request_count:
+                if cancel_response_idx == recv_count:
+                    triton_client.stop_stream(cancel_requests=True)
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    response = data_item.get_response()
+                    # Request IDs should generally be provided with each request
+                    # to associate decoupled responses with their requests.
+                    if not response.id:
+                        raise ValueError(
+                            "No response id found. Was a request_id provided?"
+                        )
+
+                    # Detect final response. Parameters are oneof and we expect bool_param
+                    if response.parameters.get("triton_final_response").bool_param:
+                        completed_requests += 1
+
+                    # Only process non-empty response, ignore if empty (no outputs)
+                    if response.outputs:
+                        if response.id not in result_dict:
+                            result_dict[response.id] = []
+                        result_dict[response.id].append((recv_count, data_item))
+                        recv_count += 1
+
+    def _stream_infer(
+        self,
+        request_count,
+        request_delay,
+        expected_count,
+        user_data,
+        result_dict,
+        delay_data=None,
+        delay_factor=None,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+    ):
+        with grpcclient.InferenceServerClient(
+            url="localhost:8001", verbose=True
+        ) as triton_client:
+            # Establish stream
+            triton_client.start_stream(
+                callback=partial(callback, user_data), stream_timeout=stream_timeout
+            )
+            # Send specified many requests in parallel
+            for i in range(request_count):
+                time.sleep((request_delay / 1000))
+                model_name = self.identity_model_name_
+                if delay_data is not None:
+                    model_name = self.decoupled_model_name_
+                    self.inputs_[1].set_data_from_numpy(delay_data)
+                if kill_server == i:
+                    os.kill(int(self.SERVER_PID), signal.SIGINT)
+                triton_client.async_stream_infer(
+                    model_name=model_name,
+                    inputs=self.inputs_,
+                    request_id=str(i),
+                    outputs=self.requested_outputs_,
+                )
+                if (delay_data is not None) and (delay_factor is not None):
+                    # Update delay input in accordance with the scaling factor
+                    delay_data = delay_data * delay_factor
+                    delay_data = delay_data.astype(np.uint32)
+
+            # Retrieve results...
+            recv_count = 0
+            while recv_count < expected_count:
+                if cancel_response_idx == recv_count:
+                    triton_client.stop_stream(cancel_requests=True)
+                data_item = user_data._response_queue.get()
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    this_id = data_item.get_response().id
+                    if this_id not in result_dict:
+                        result_dict[this_id] = []
+                    result_dict[this_id].append((recv_count, data_item))
+
+                recv_count += 1
+
+    def _streaming_infer(
+        self,
+        request_count,
+        request_delay=0,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+        should_error=True,
+    ):
+        self._prepare_inputs_and_outputs("streaming")
+
+        input_data = np.array([[1.0]], dtype=np.float32)
+        self.inputs_[0].set_data_from_numpy(input_data)
+
+        user_data = UserData()
+        result_dict = {}
+
+        try:
+            expected_count = request_count
+            self._stream_infer(
+                request_count,
+                request_delay,
+                expected_count,
+                user_data,
+                result_dict,
+                cancel_response_idx=cancel_response_idx,
+                stream_timeout=stream_timeout,
+                kill_server=kill_server,
+            )
+        except Exception as ex:
+            if cancel_response_idx or stream_timeout or should_error:
+                raise ex
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        # Validate the results..
+        for i in range(request_count):
+            this_id = str(i)
+            if this_id not in result_dict.keys():
+                self.assertTrue(
+                    False, "response for request id {} not received".format(this_id)
+                )
+            self.assertEqual(len(result_dict[this_id]), 1)
+            result = result_dict[this_id][0][1]
+            output0_data = result.as_numpy("OUTPUT0")
+            self.assertTrue(np.array_equal(input_data, output0_data))
+
+    def _decoupled_infer(
+        self,
+        request_count,
+        request_delay=0,
+        repeat_count=1,
+        data_offset=100,
+        delay_time=1000,
+        delay_factor=1,
+        wait_time=500,
+        cancel_response_idx=None,
+        stream_timeout=None,
+        kill_server=None,
+        should_error=True,
+        infer_helper_map=[True, True],
+    ):
+        self._prepare_inputs_and_outputs(kind="decoupled_streaming")
+
+        # Initialize data for IN
+        input_data = np.arange(
+            start=data_offset, stop=data_offset + repeat_count, dtype=np.int32
+        )
+        self.inputs_[0].set_shape([repeat_count])
+        self.inputs_[0].set_data_from_numpy(input_data)
+
+        # Initialize data for DELAY
+        delay_data = (np.ones([repeat_count], dtype=np.uint32)) * delay_time
+        self.inputs_[1].set_shape([repeat_count])
+
+        # Initialize data for WAIT
+        wait_data = np.array([wait_time], dtype=np.uint32)
+        self.inputs_[2].set_data_from_numpy(wait_data)
+
+        infer_helpers = []
+        if infer_helper_map[0]:
+            infer_helpers.append(self._stream_infer)
+        if infer_helper_map[1]:
+            infer_helpers.append(self._stream_infer_with_params)
+
+        for infer_helper in infer_helpers:
+            user_data = UserData()
+            result_dict = {}
+
+            try:
+                expected_count = repeat_count * request_count
+                infer_helper(
+                    request_count,
+                    request_delay,
+                    expected_count,
+                    user_data,
+                    result_dict,
+                    delay_data,
+                    delay_factor,
+                    cancel_response_idx,
+                    stream_timeout,
+                    kill_server,
+                )
+            except Exception as ex:
+                if cancel_response_idx or stream_timeout or should_error:
+                    raise ex
+                self.assertTrue(False, "unexpected error {}".format(ex))
+
+            # Validate the results..
+            for i in range(request_count):
+                this_id = str(i)
+                if repeat_count != 0 and this_id not in result_dict.keys():
+                    self.assertTrue(
+                        False, "response for request id {} not received".format(this_id)
+                    )
+                elif repeat_count == 0 and this_id in result_dict.keys():
+                    self.assertTrue(
+                        False,
+                        "received unexpected response for request id {}".format(
+                            this_id
+                        ),
+                    )
+                if repeat_count != 0:
+                    self.assertEqual(len(result_dict[this_id]), repeat_count)
+                    expected_data = data_offset
+                    result_list = result_dict[this_id]
+                    for j in range(len(result_list)):
+                        this_data = result_list[j][1].as_numpy("OUT")
+                        self.assertEqual(len(this_data), 1)
+                        self.assertEqual(this_data[0], expected_data)
+                        this_idx = result_list[j][1].as_numpy("IDX")
+                        self.assertEqual(len(this_idx), 1)
+                        self.assertEqual(this_idx[0], j)
+                        expected_data += 1
+
+    ###
+    ### Non-Streaming Tests
+    ###
+    def test_simple_infer(self):
+        # This test case sends 10 asynchronous requests and validates
+        # the response.
+        self._simple_infer(request_count=10)
+
+    def test_simple_infer_cancellation(self):
+        # This test case is used to check whether all the states are
+        # correctly released when one of the request is cancelled from
+        # the client side.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10, cancel_response_idx=5)
+        self.assertIn("Locally cancelled by application!", str(cm.exception))
+
+    def test_simple_infer_timeout(self):
+        # This test case is used to check whether all the states are
+        # correctly released when the request gets timed-out on the client.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10, client_timeout_pair=[5, 0.1])
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+    def test_simple_infer_error_status(self):
+        # This test case is used to check whether all the state objects are
+        # released when RPC runs into error.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10)
+        self.assertIn(
+            "This protocol is restricted, expecting header 'triton-grpc-protocol-infer-key'",
+            str(cm.exception),
+        )
+
+    def test_simple_infer_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run with final parameters being returned.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._simple_infer(request_count=10, kill_server=5)
+
+    ###
+    ### Streaming Tests
+    ###
+    def test_streaming_infer(self):
+        # Sanity test to check whether all the state objects
+        # are correctly released. Sends 10 requests in a single
+        # gRPC bidirectional stream.
+        self._streaming_infer(request_count=10)
+
+    def test_streaming_cancellation(self):
+        # This test case is used to check whether all the states are
+        # correctly released when the stream is closed when fifth
+        # response is received.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(request_count=10, cancel_response_idx=5)
+        self.assertIn("Locally cancelled by application!", str(cm.exception))
+
+    def test_streaming_timeout(self):
+        # This test case is used to check whether all the states are
+        # released when some of the requests timeouts.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(request_count=10, request_delay=1, stream_timeout=2)
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+    def test_streaming_error_status(self):
+        # This test case is used to check whether all the state objects are
+        # released when RPC runs into error.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(request_count=10, should_error=True)
+        self.assertIn(
+            "This protocol is restricted, expecting header 'triton-grpc-protocol-infer-key'",
+            str(cm.exception),
+        )
+
+    def test_streaming_infer_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._streaming_infer(
+                request_count=10,
+                request_delay=1,
+                kill_server=5,
+                should_error=True,
+            )
+
+    ###
+    ### Decoupled Streaming Tests
+    ###
+    def test_decoupled_infer(self):
+        # Sanity test to check whether all the state objects
+        # are correctly released. Sends 10 requests in a single
+        # gRPC bidirectional stream and expects each of these
+        # requests to generate 10 responses.
+        self._decoupled_infer(request_count=10, repeat_count=10)
+
+    def test_decoupled_cancellation(self):
+        # This test case is used to check whether all the states are
+        # correctly released when the stream is closed when fifth
+        # response is received.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10, repeat_count=10, cancel_response_idx=5
+            )
+        self.assertIn("Locally cancelled by application!", str(cm.exception))
+
+    def test_decoupled_timeout(self):
+        # This test case is used to check whether all the states are
+        # released when some of the requests timeouts.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10, repeat_count=10, request_delay=1, stream_timeout=2
+            )
+        self.assertIn("Deadline Exceeded", str(cm.exception))
+
+    def test_decoupled_error_status(self):
+        # This test case is used to check whether all the state objects are
+        # released when RPC runs into error.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(request_count=10, repeat_count=10, should_error=True)
+        self.assertIn(
+            "This protocol is restricted, expecting header 'triton-grpc-protocol-infer-key'",
+            str(cm.exception),
+        )
+
+    def test_decoupled_infer_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10,
+                repeat_count=10,
+                request_delay=1,
+                kill_server=5,
+                should_error=True,
+                infer_helper_map=[True, False],
+            )
+
+    def test_decoupled_infer_with_params_shutdownserver(self):
+        # This test case is used to check whether all the state objects are
+        # released when the server is interrupted to shutdown in middle of
+        # inference run with final parameters being returned.
+        with self.assertRaises(InferenceServerException) as cm:
+            self._decoupled_infer(
+                request_count=10,
+                repeat_count=10,
+                request_delay=1,
+                kill_server=5,
+                should_error=True,
+                infer_helper_map=[False, True],
+            )
+
+
+if __name__ == "__main__":
+    CleanUpTest.SERVER_PID = os.environ.get("SERVER_PID", CleanUpTest.SERVER_PID)
+    unittest.main()
diff --git a/qa/L0_grpc_state_cleanup/test.sh b/qa/L0_grpc_state_cleanup/test.sh
new file mode 100755
index 0000000000..605edb6f9c
--- /dev/null
+++ b/qa/L0_grpc_state_cleanup/test.sh
@@ -0,0 +1,194 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+RET=0
+CLEANUP_TEST=cleanup_test.py
+
+rm -f *.log
+
+CLIENT_LOG=`pwd`/client.log
+SERVER=/opt/tritonserver/bin/tritonserver
+source ../common/util.sh
+
+function check_state_release() {
+  local log_file=$1
+
+  num_state_release=`cat $log_file | grep  "StateRelease" | wc -l`
+  num_state_new=`cat $log_file | grep  "StateNew" | wc -l`
+
+  if [ $num_state_release -ne $num_state_new ]; then
+    cat $log_file
+    echo -e "\n***\n*** Test Failed: Mismatch detected, $num_state_new state(s) created, $num_state_release state(s) released. \n***" >> $log_file
+    return 1
+  fi
+
+  return 0
+}
+
+rm -fr ./models/custom_zero_1_float32 && \
+        cp -r ../custom_models/custom_zero_1_float32 ./models/. && \
+        mkdir -p ./models/custom_zero_1_float32/1
+
+(cd models/custom_zero_1_float32 && \
+    echo "parameters [" >> config.pbtxt && \
+    echo "{ key: \"execute_delay_ms\"; value: { string_value: \"1000\" }}" >> config.pbtxt && \
+    echo "]" >> config.pbtxt)
+
+for i in test_simple_infer \
+            test_simple_infer_cancellation \
+            test_simple_infer_timeout \
+            test_streaming_infer \
+            test_streaming_timeout \
+            test_streaming_cancellation \
+            test_decoupled_infer \
+            test_decoupled_cancellation \
+            test_decoupled_timeout; do
+  SERVER_LOG="./inference_server.$i.log"
+  SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2"
+  run_server
+  if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+  fi
+
+  echo "Test: $i" >>$CLIENT_LOG
+
+  set +e
+  python $CLEANUP_TEST CleanUpTest.$i >>$CLIENT_LOG 2>&1
+  if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+  fi
+
+  kill $SERVER_PID
+  wait $SERVER_PID
+
+  check_state_release $SERVER_LOG
+  if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** State Verification Failed for $i\n***"
+      RET=1
+  fi
+  set -e
+done
+
+
+for i in test_simple_infer_error_status \
+                test_streaming_error_status \
+                test_decoupled_error_status; do
+  SERVER_LOG="./inference_server.$i.log"
+  SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2 --grpc-restricted-protocol=inference:infer-key=infer-value"
+  run_server
+  if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+  fi
+
+  echo "Test: $i" >>$CLIENT_LOG
+
+  set +e
+  python $CLEANUP_TEST CleanUpTest.$i >>$CLIENT_LOG 2>&1
+  if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+  fi
+
+  kill $SERVER_PID
+  wait $SERVER_PID
+
+  check_state_release $SERVER_LOG
+  if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** State Verification Failed for $i\n***"
+      RET=1
+  fi
+
+  set -e
+done
+
+for i in test_simple_infer_shutdownserver \
+         test_streaming_infer_shutdownserver \
+         test_decoupled_infer_shutdownserver \
+         test_decoupled_infer_with_params_shutdownserver; do
+  SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2"
+  SERVER_LOG="./inference_server.$i.log"
+  run_server
+  if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+  fi
+
+  echo "Test: $i" >>$CLIENT_LOG
+
+  set +e
+  SERVER_PID=$SERVER_PID python $CLEANUP_TEST CleanUpTest.$i >>$CLIENT_LOG 2>&1
+  if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Test $i Failed\n***" >>$CLIENT_LOG
+    echo -e "\n***\n*** Test $i Failed\n***"
+    RET=1
+  fi
+
+  wait $SERVER_PID
+
+  check_state_release $SERVER_LOG
+  if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** State Verification Failed for $i\n***"
+      RET=1
+  fi
+
+  set -e
+done
+
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test Failed\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_http/generate_endpoint_test.py b/qa/L0_http/generate_endpoint_test.py
new file mode 100755
index 0000000000..29d2e20d96
--- /dev/null
+++ b/qa/L0_http/generate_endpoint_test.py
@@ -0,0 +1,419 @@
+#!/usr/bin/python3
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import json
+import threading
+import time
+import unittest
+
+import requests
+import sseclient
+import test_util as tu
+
+
+class GenerateEndpointTest(tu.TestResultCollector):
+    def setUp(self):
+        self._model_name = "mock_llm"
+
+    def _get_infer_url(self, model_name, route):
+        return f"http://localhost:8000/v2/models/{model_name}/{route}"
+
+    def generate_stream(self, model_name, inputs, stream=False):
+        headers = {"Accept": "text/event-stream"}
+        url = self._get_infer_url(model_name, "generate_stream")
+        # stream=True used to indicate response can be iterated over, which
+        # should be the common setting for generate_stream.
+        # For correctness test case, stream=False so that we can re-examine
+        # the response content.
+        return requests.post(
+            url,
+            data=inputs if isinstance(inputs, str) else json.dumps(inputs),
+            headers=headers,
+            stream=stream,
+        )
+
+    def generate(self, model_name, inputs):
+        url = self._get_infer_url(model_name, "generate")
+        return requests.post(
+            url, data=inputs if isinstance(inputs, str) else json.dumps(inputs)
+        )
+
+    def generate_expect_failure(self, model_name, inputs, msg):
+        url = self._get_infer_url(model_name, "generate")
+        r = requests.post(
+            url, data=inputs if isinstance(inputs, str) else json.dumps(inputs)
+        )
+        # Content-Type header should always be JSON for errors
+        self.assertEqual(r.headers["Content-Type"], "application/json")
+
+        try:
+            r.raise_for_status()
+            self.assertTrue(False, f"Expected failure, success for {inputs}")
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(msg, r.json()["error"])
+
+    def generate_stream_expect_failure(self, model_name, inputs, msg):
+        r = self.generate_stream(model_name, inputs)
+        # Content-Type header should always be JSON for errors
+        self.assertEqual(r.headers["Content-Type"], "application/json")
+
+        try:
+            r.raise_for_status()
+            self.assertTrue(False, f"Expected failure, success for {inputs}")
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(msg, r.json()["error"])
+
+    def generate_stream_expect_success(
+        self, model_name, inputs, expected_output, rep_count
+    ):
+        r = self.generate_stream(model_name, inputs)
+        r.raise_for_status()
+        self.check_sse_responses(r, [{"TEXT": expected_output}] * rep_count)
+
+    def check_sse_responses(self, res, expected_res):
+        # Validate SSE format
+        self.assertIn("Content-Type", res.headers)
+        self.assertEqual(
+            "text/event-stream; charset=utf-8", res.headers["Content-Type"]
+        )
+
+        # SSE format (data: []) is hard to parse, use helper library for simplicity
+        client = sseclient.SSEClient(res)
+        res_count = 0
+        for event in client.events():
+            # Parse event data, join events into a single response
+            data = json.loads(event.data)
+            for key, value in expected_res[res_count].items():
+                self.assertIn(key, data)
+                self.assertEqual(value, data[key])
+            res_count += 1
+        self.assertEqual(len(expected_res), res_count)
+        # Make sure there is no message in the wrong form
+        for remaining in client._read():
+            self.assertTrue(
+                remaining.startswith(b"data:"),
+                f"SSE response not formed properly, got: {remaining}",
+            )
+            self.assertTrue(
+                remaining.endswith(b"\n\n"),
+                f"SSE response not formed properly, got: {remaining}",
+            )
+
+    def test_generate(self):
+        # Setup text-based input
+        text = "hello world"
+        inputs = {"PROMPT": text, "STREAM": False}
+
+        r = self.generate(self._model_name, inputs)
+        r.raise_for_status()
+
+        self.assertIn("Content-Type", r.headers)
+        self.assertEqual(r.headers["Content-Type"], "application/json")
+
+        data = r.json()
+        self.assertIn("TEXT", data)
+        self.assertEqual(text, data["TEXT"])
+
+    def test_generate_stream(self):
+        # Setup text-based input
+        text = "hello world"
+        rep_count = 3
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": rep_count}
+        self.generate_stream_expect_success(self._model_name, inputs, text, rep_count)
+
+    def test_streaming(self):
+        # verify the responses are streamed as soon as it is generated
+        text = "hello world"
+        rep_count = 3
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": rep_count, "DELAY": 2}
+        past = time.time()
+        res = self.generate_stream(self._model_name, inputs, stream=True)
+        client = sseclient.SSEClient(res)
+        # This test does not focus on event content
+        for _ in client.events():
+            now = time.time()
+            self.assertTrue(1 < (now - past) < 3)
+            past = now
+
+    def test_missing_inputs(self):
+        missing_all_inputs = [
+            # Missing all inputs
+            {},
+            {"abc": 123},
+        ]
+        missing_one_input = [
+            # Missing 1 input
+            {"PROMPT": "hello"},
+            {"STREAM": False},
+            {"STREAM": False, "other": "param"},
+        ]
+        for inputs in missing_all_inputs:
+            self.generate_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 0"
+            )
+            self.generate_stream_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 0"
+            )
+
+        for inputs in missing_one_input:
+            self.generate_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 1"
+            )
+            self.generate_stream_expect_failure(
+                self._model_name, inputs, "expected 2 inputs but got 1"
+            )
+
+    def test_invalid_input_types(self):
+        invalid_bool = "attempt to access JSON non-boolean as boolean"
+        invalid_string = "attempt to access JSON non-string as string"
+        invalid_type_inputs = [
+            # Prompt bad type
+            ({"PROMPT": 123, "STREAM": False}, invalid_string),
+            # Stream bad type
+            ({"PROMPT": "hello", "STREAM": "false"}, invalid_bool),
+            # Both bad type, parsed in order
+            ({"PROMPT": True, "STREAM": 123}, invalid_string),
+            ({"STREAM": 123, "PROMPT": True}, invalid_bool),
+        ]
+
+        for inputs, error_msg in invalid_type_inputs:
+            self.generate_expect_failure(self._model_name, inputs, error_msg)
+            self.generate_stream_expect_failure(self._model_name, inputs, error_msg)
+
+    def test_duplicate_inputs(self):
+        dupe_prompt = "input 'PROMPT' already exists in request"
+        dupe_stream = "input 'STREAM' already exists in request"
+        # Use JSON string directly as Python Dict doesn't support duplicate keys
+        invalid_type_inputs = [
+            # One duplicate
+            (
+                '{"PROMPT": "hello", "STREAM": false, "PROMPT": "duplicate"}',
+                dupe_prompt,
+            ),
+            ('{"PROMPT": "hello", "STREAM": false, "STREAM": false}', dupe_stream),
+            # Multiple duplicates, parsed in order
+            (
+                '{"PROMPT": "hello", "STREAM": false, "PROMPT": "duplicate", "STREAM": true}',
+                dupe_prompt,
+            ),
+            (
+                '{"PROMPT": "hello", "STREAM": false, "STREAM": true, "PROMPT": "duplicate"}',
+                dupe_stream,
+            ),
+        ]
+        for inputs, error_msg in invalid_type_inputs:
+            self.generate_expect_failure(self._model_name, inputs, error_msg)
+            self.generate_stream_expect_failure(self._model_name, inputs, error_msg)
+
+    def test_generate_stream_response_error(self):
+        # Setup text-based input
+        text = "hello world"
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": 0, "FAIL_LAST": True}
+        r = self.generate_stream(self._model_name, inputs)
+
+        # With "REPETITION": 0, error will be first response and the HTTP code
+        # will be set properly
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.check_sse_responses(r, [{"error": "An Error Occurred"}])
+
+        # With "REPETITION" > 0, the first response is valid response and set
+        # HTTP code to success, so user must validate each response
+        inputs["REPETITION"] = 1
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+
+        self.check_sse_responses(r, [{"TEXT": text}, {"error": "An Error Occurred"}])
+
+    def test_race_condition(self):
+        # In Triton HTTP frontend, the HTTP response is sent in a different
+        # thread than Triton response complete thread, both programs have shared
+        # access to the same object, so this test is sending sufficient load to
+        # the endpoint, in attempt to expose race condition if any  .
+        input1 = {"PROMPT": "hello", "STREAM": False, "param": "segfault"}
+        input2 = {
+            "PROMPT": "hello",
+            "STREAM": True,
+            "REPETITION": 3,
+            "param": "segfault",
+        }
+        threads = []
+
+        def thread_func(model_name, inputs):
+            self.generate_stream(model_name, inputs).raise_for_status()
+
+        for _ in range(50):
+            threads.append(
+                threading.Thread(target=thread_func, args=((self._model_name, input1)))
+            )
+            threads.append(
+                threading.Thread(target=thread_func, args=((self._model_name, input2)))
+            )
+        for thread in threads:
+            thread.start()
+        for thread in threads:
+            thread.join()
+
+    def test_one_response(self):
+        # In the current 'inputs' setting, the model will send at least 1
+        # response, "STREAM" controls model behavior on sending model responses:
+        # If True, the model sends two responses, one is the actual infer
+        # response and the other contains flag only to signal end of response.
+        # 'generate_stream' endpoint is designed for this case so it should send
+        # infer response and complete HTTP response appropriately. And
+        # 'generate' endpoint will be able to handle this case as at its core
+        # only one infer response is received, which is the same as typical HTTP
+        # usage.
+        # If False, the model sends one response containing infer response and
+        # end flag, which is the same as how non-decoupled model responds.
+        inputs = {"PROMPT": "hello world", "STREAM": True}
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        r = self.generate(self._model_name, inputs)
+        r.raise_for_status()
+
+        inputs["STREAM"] = False
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        r = self.generate(self._model_name, inputs)
+        r.raise_for_status()
+
+    def test_zero_response(self):
+        inputs = {"PROMPT": "hello world", "STREAM": True, "REPETITION": 0}
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        # Expect generate fails the inference
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "generate expects model to produce exactly 1 response",
+                r.json()["error"],
+            )
+
+    def test_many_response(self):
+        inputs = {"PROMPT": "hello world", "STREAM": True, "REPETITION": 2}
+        r = self.generate_stream(self._model_name, inputs)
+        r.raise_for_status()
+        # Expect generate fails the inference
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "generate expects model to produce exactly 1 response",
+                r.json()["error"],
+            )
+
+    def test_complex_schema(self):
+        # Currently only the fundamental conversion is supported, nested object
+        # in the request will results in parsing error
+
+        # complex object to parameters (specifying non model input)
+        inputs = {
+            "PROMPT": "hello world",
+            "STREAM": True,
+            "PARAMS": {"PARAM_0": 0, "PARAM_1": True},
+        }
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn("parameter 'PARAMS' has invalid type", r.json()["error"])
+
+        # complex object to model input
+        inputs = {
+            "PROMPT": {"USER": "hello world", "BOT": "world hello"},
+            "STREAM": True,
+        }
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "attempt to access JSON non-string as string", r.json()["error"]
+            )
+
+    def test_close_connection_during_streaming(self):
+        # verify the responses are streamed as soon as it is generated
+        text = "hello world"
+        rep_count = 3
+        inputs = {"PROMPT": [text], "STREAM": True, "REPETITION": rep_count, "DELAY": 2}
+        res = self.generate_stream(self._model_name, inputs, stream=True)
+        # close connection while the responses are being generated
+        res.close()
+        # check server healthiness
+        health_url = "http://localhost:8000/v2/health/live"
+        requests.get(health_url).raise_for_status()
+
+    def test_parameters(self):
+        # Test reserved nested object for parameters
+        text = "hello world"
+        rep_count = 3
+        inputs = {
+            "PROMPT": [text],
+            "STREAM": True,
+            "parameters": {"REPETITION": rep_count},
+        }
+        self.generate_stream_expect_success(self._model_name, inputs, text, rep_count)
+
+        # parameters keyword is not an object
+        inputs = {"PROMPT": [text], "STREAM": True, "parameters": 1}
+
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "Expected JSON object for keyword: 'parameters'", r.json()["error"]
+            )
+
+        # parameters contains complex object
+        inputs = {
+            "PROMPT": [text],
+            "STREAM": True,
+            "parameters": {"nested": {"twice": 1}},
+        }
+
+        r = self.generate(self._model_name, inputs)
+        try:
+            r.raise_for_status()
+        except requests.exceptions.HTTPError as e:
+            self.assertIn(
+                "Converting keyword: 'parameters': parameter 'nested' has invalid type.",
+                r.json()["error"],
+            )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/generate_models/mock_llm/1/model.py b/qa/L0_http/generate_models/mock_llm/1/model.py
new file mode 100644
index 0000000000..9c5e9423e4
--- /dev/null
+++ b/qa/L0_http/generate_models/mock_llm/1/model.py
@@ -0,0 +1,107 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import json
+import time
+
+import numpy as np
+import triton_python_backend_utils as pb_utils
+
+
+class TritonPythonModel:
+    def initialize(self, args):
+        self.model_config = json.loads(args["model_config"])
+        self.decoupled = self.model_config.get("model_transaction_policy", {}).get(
+            "decoupled"
+        )
+
+    def execute(self, requests):
+        if self.decoupled:
+            return self.exec_decoupled(requests)
+        else:
+            return self.exec(requests)
+
+    def exec(self, requests):
+        responses = []
+        for request in requests:
+            params = json.loads(request.parameters())
+            rep_count = params["REPETITION"] if "REPETITION" in params else 1
+
+            input_np = pb_utils.get_input_tensor_by_name(request, "PROMPT").as_numpy()
+            stream_np = pb_utils.get_input_tensor_by_name(request, "STREAM").as_numpy()
+            stream = stream_np.flatten()[0]
+            if stream:
+                responses.append(
+                    pb_utils.InferenceResponse(
+                        error=pb_utils.TritonError(
+                            "STREAM only supported in decoupled mode"
+                        )
+                    )
+                )
+            else:
+                out_tensor = pb_utils.Tensor(
+                    "TEXT", np.repeat(input_np, rep_count, axis=1)
+                )
+                responses.append(pb_utils.InferenceResponse([out_tensor]))
+        return responses
+
+    def exec_decoupled(self, requests):
+        for request in requests:
+            params = json.loads(request.parameters())
+            rep_count = params["REPETITION"] if "REPETITION" in params else 1
+            fail_last = params["FAIL_LAST"] if "FAIL_LAST" in params else False
+            delay = params["DELAY"] if "DELAY" in params else None
+
+            sender = request.get_response_sender()
+            input_np = pb_utils.get_input_tensor_by_name(request, "PROMPT").as_numpy()
+            stream_np = pb_utils.get_input_tensor_by_name(request, "STREAM").as_numpy()
+            out_tensor = pb_utils.Tensor("TEXT", input_np)
+            response = pb_utils.InferenceResponse([out_tensor])
+            # If stream enabled, just send multiple copies of response
+            # FIXME: Could split up response string into tokens, but this is simpler for now.
+            stream = stream_np.flatten()[0]
+            if stream:
+                for _ in range(rep_count):
+                    if delay is not None:
+                        time.sleep(delay)
+                    if not sender.is_cancelled():
+                        sender.send(response)
+                    else:
+                        break
+                sender.send(
+                    None
+                    if not fail_last
+                    else pb_utils.InferenceResponse(
+                        error=pb_utils.TritonError("An Error Occurred")
+                    ),
+                    flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,
+                )
+            # If stream disabled, just send one response
+            else:
+                sender.send(
+                    response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL
+                )
+        return None
diff --git a/qa/L0_http/generate_models/mock_llm/config.pbtxt b/qa/L0_http/generate_models/mock_llm/config.pbtxt
new file mode 100644
index 0000000000..6871661525
--- /dev/null
+++ b/qa/L0_http/generate_models/mock_llm/config.pbtxt
@@ -0,0 +1,60 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+backend: "python"
+
+max_batch_size: 0
+
+model_transaction_policy {
+  decoupled: True
+}
+
+input [
+  {
+    name: "PROMPT"
+    data_type: TYPE_STRING
+    dims: [ 1, 1 ]
+  },
+  {
+    name: "STREAM"
+    data_type: TYPE_BOOL
+    dims: [ 1, 1 ]
+  }
+]
+
+output [
+  {
+    name: "TEXT"
+    data_type: TYPE_STRING
+    dims: [ 1, -1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind: KIND_MODEL
+  }
+]
diff --git a/qa/L0_http/http_basic_auth_test.py b/qa/L0_http/http_basic_auth_test.py
new file mode 100755
index 0000000000..5aa1f71d81
--- /dev/null
+++ b/qa/L0_http/http_basic_auth_test.py
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+import sys
+import unittest
+
+sys.path.append("../common")
+
+import test_util as tu
+import tritonclient.http as tritonhttpclient
+import tritonclient.http.aio as asynctritonhttpclient
+from tritonclient.http.aio.auth import BasicAuth as AsyncBasicAuth
+from tritonclient.http.auth import BasicAuth
+
+
+class HTTPBasicAuthTest(tu.TestResultCollector):
+    def setUp(self):
+        # Use the nginx port
+        self._client = tritonhttpclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(BasicAuth("username", "password"))
+
+    def test_client_call(self):
+        self.assertTrue(self._client.is_server_live())
+
+    def tearDown(self):
+        self._client.close()
+
+
+class HTTPBasicAuthAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        # Use the nginx port
+        self._client = asynctritonhttpclient.InferenceServerClient(url="localhost:8004")
+        self._client.register_plugin(AsyncBasicAuth("username", "password"))
+
+    async def test_client_call(self):
+        self.assertTrue(await self._client.is_server_live())
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/http_client_plugin_test.py b/qa/L0_http/http_client_plugin_test.py
new file mode 100755
index 0000000000..963ea2a81b
--- /dev/null
+++ b/qa/L0_http/http_client_plugin_test.py
@@ -0,0 +1,175 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+from unittest.mock import AsyncMock, MagicMock, patch
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as tritonhttpclient
+import tritonclient.http.aio as asynctritonhttpclient
+from tritonclient.http import InferenceServerClientPlugin
+from tritonclient.utils import np_to_triton_dtype
+
+
+# A simple plugin that adds headers to the inference request.
+class TestPlugin(InferenceServerClientPlugin):
+    def __init__(self, headers):
+        self._headers = headers
+
+    def __call__(self, request):
+        request.headers.update(self._headers)
+
+
+class HTTPClientPluginAsyncTest(unittest.IsolatedAsyncioTestCase):
+    async def asyncSetUp(self):
+        self._headers = {"MY-KEY": "MY-VALUE"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = asynctritonhttpclient.InferenceServerClient(url="localhost:8001")
+
+    async def test_server_is_live(self):
+        # We are testing is_server_live as an example API that uses GET method
+        # for communication with the server.
+        self._client._stub.get = AsyncMock()
+
+        self._client.register_plugin(self._plugin)
+        self.assertEqual(self._plugin, self._client.plugin())
+        await self._client.is_server_live()
+        self._client._stub.get.assert_awaited_with(
+            url=unittest.mock.ANY, headers=self._headers
+        )
+
+        # Make sure unregistering the plugin would no longer add the headers
+        self._client.unregister_plugin()
+        self.assertEqual(None, self._client.plugin())
+        await self._client.is_server_live()
+        self._client._stub.get.assert_awaited_with(url=unittest.mock.ANY, headers={})
+
+    async def test_simple_infer(self):
+        # Only the read function must return async
+        post_return = MagicMock()
+        post_return.read = AsyncMock()
+        self._client._stub.post = AsyncMock(return_value=post_return)
+
+        np_input = np.arange(8, dtype=np.float32).reshape(1, -1)
+        model = "onnx_zero_1_float32"
+
+        # Setup inputs
+        inputs = []
+        inputs.append(
+            tritonhttpclient.InferInput(
+                "INPUT0", np_input.shape, np_to_triton_dtype(np_input.dtype)
+            )
+        )
+
+        # Set the binary data to False so that 'Inference-Header-Length' is not
+        # added to the headers.
+        inputs[0].set_data_from_numpy(np_input, binary_data=False)
+
+        async def run_infer(headers):
+            with patch("tritonclient.http.aio._raise_if_error"):
+                with patch("tritonclient.http.aio.InferResult"):
+                    await self._client.infer(model_name=model, inputs=inputs)
+                    self._client._stub.post.assert_awaited_with(
+                        url=unittest.mock.ANY, data=unittest.mock.ANY, headers=headers
+                    )
+
+        self._client.register_plugin(self._plugin)
+        await run_infer(self._headers)
+
+        self._client.unregister_plugin()
+        await run_infer({})
+
+    async def asyncTearDown(self):
+        await self._client.close()
+
+
+class HTTPClientPluginTest(tu.TestResultCollector):
+    def setUp(self):
+        self._headers = {"MY-KEY": "MY-VALUE"}
+        self._plugin = TestPlugin(self._headers)
+        self._client = tritonhttpclient.InferenceServerClient(url="localhost:8001")
+
+        # Use magic mock for the client stub
+        self._client._client_stub = MagicMock()
+
+    def test_server_is_live(self):
+        # We are testing is_server_live as an example API that uses GET method
+        # for communication with the server.
+        self._client.register_plugin(self._plugin)
+        self._client.is_server_live()
+        self._client._client_stub.get.assert_called_with(
+            unittest.mock.ANY, headers=self._headers
+        )
+
+        # Make sure unregistering the plugin would no longer add the headers
+        self._client.unregister_plugin()
+        self._client.is_server_live()
+        self._client._client_stub.get.assert_called_with(unittest.mock.ANY, headers={})
+
+    def test_simple_infer(self):
+        np_input = np.arange(8, dtype=np.float32).reshape(1, -1)
+        model = "onnx_zero_1_float32"
+
+        # Setup inputs
+        inputs = []
+        inputs.append(
+            tritonhttpclient.InferInput(
+                "INPUT0", np_input.shape, np_to_triton_dtype(np_input.dtype)
+            )
+        )
+
+        # Set the binary data to False so that 'Inference-Header-Length' is not
+        # added to the headers.
+        inputs[0].set_data_from_numpy(np_input, binary_data=False)
+
+        def run_infer(headers):
+            with patch("tritonclient.http._client._raise_if_error"):
+                with patch("tritonclient.http._client.InferResult"):
+                    self._client.infer(model_name=model, inputs=inputs)
+                    self._client._client_stub.post.assert_called_with(
+                        request_uri=unittest.mock.ANY,
+                        body=unittest.mock.ANY,
+                        headers=headers,
+                    )
+
+        self._client.register_plugin(self._plugin)
+        run_infer(self._headers)
+
+        self._client.unregister_plugin()
+        run_infer({})
+
+    def tearDown(self):
+        self._client.close()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/http_restricted_api_test.py b/qa/L0_http/http_restricted_api_test.py
new file mode 100755
index 0000000000..e5e3d5fd2d
--- /dev/null
+++ b/qa/L0_http/http_restricted_api_test.py
@@ -0,0 +1,94 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import tritonclient.http as tritonhttpclient
+from tritonclient.utils import InferenceServerException
+
+
+class RestrictedAPITest(unittest.TestCase):
+    def setUp(self):
+        self.model_name_ = "simple"
+        self.client_ = tritonhttpclient.InferenceServerClient("localhost:8000")
+
+    # Other unspecified APIs should not be restricted
+    def test_sanity(self):
+        self.client_.get_inference_statistics("simple")
+        self.client_.get_inference_statistics(
+            "simple", headers={"infer-key": "infer-value"}
+        )
+
+    # metadata, infer, model repository APIs are restricted.
+    # metadata and infer expects "infer-key : infer-value" header,
+    # model repository expected "admin-key : admin-value".
+    def test_model_repository(self):
+        with self.assertRaisesRegex(InferenceServerException, "This API is restricted"):
+            self.client_.unload_model(
+                self.model_name_, headers={"infer-key": "infer-value"}
+            )
+        # Request go through and get actual transaction error
+        with self.assertRaisesRegex(
+            InferenceServerException, "explicit model load / unload is not allowed"
+        ):
+            self.client_.unload_model(
+                self.model_name_, headers={"admin-key": "admin-value"}
+            )
+
+    def test_metadata(self):
+        with self.assertRaisesRegex(InferenceServerException, "This API is restricted"):
+            self.client_.get_server_metadata()
+        self.client_.get_server_metadata({"infer-key": "infer-value"})
+
+    def test_infer(self):
+        # setup
+        inputs = [
+            tritonhttpclient.InferInput("INPUT0", [1, 16], "INT32"),
+            tritonhttpclient.InferInput("INPUT1", [1, 16], "INT32"),
+        ]
+        inputs[0].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.ones(shape=(1, 16), dtype=np.int32))
+
+        # This test only care if the request goes through
+        with self.assertRaisesRegex(InferenceServerException, "This API is restricted"):
+            _ = self.client_.infer(
+                model_name=self.model_name_, inputs=inputs, headers={"test": "1"}
+            )
+        self.client_.infer(
+            model_name=self.model_name_,
+            inputs=inputs,
+            headers={"infer-key": "infer-value"},
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/http_test.py b/qa/L0_http/http_test.py
old mode 100644
new mode 100755
index 2a5e3c141e..1f292ffb88
--- a/qa/L0_http/http_test.py
+++ b/qa/L0_http/http_test.py
@@ -1,5 +1,5 @@
 #!/usr/bin/python
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -29,40 +29,39 @@
 
 sys.path.append("../common")
 
-import requests
 import unittest
+
 import numpy as np
+import requests
 import test_util as tu
 import tritonclient.http as tritonhttpclient
-from tritonclient.utils import np_to_triton_dtype, InferenceServerException
+from tritonclient.utils import InferenceServerException, np_to_triton_dtype
 
 
 class HttpTest(tu.TestResultCollector):
-
     def _get_infer_url(self, model_name):
         return "http://localhost:8000/v2/models/{}/infer".format(model_name)
 
-    def _raw_binary_helper(self,
-                           model,
-                           input_bytes,
-                           expected_output_bytes,
-                           extra_headers={}):
+    def _raw_binary_helper(
+        self, model, input_bytes, expected_output_bytes, extra_headers={}
+    ):
         # Select model that satisfies constraints for raw binary request
-        headers = {'Inference-Header-Content-Length': '0'}
+        headers = {"Inference-Header-Content-Length": "0"}
         # Add extra headers (if any) before sending request
         headers.update(extra_headers)
-        r = requests.post(self._get_infer_url(model),
-                          data=input_bytes,
-                          headers=headers)
+        r = requests.post(self._get_infer_url(model), data=input_bytes, headers=headers)
         r.raise_for_status()
 
         # Get the inference header size so we can locate the output binary data
         header_size = int(r.headers["Inference-Header-Content-Length"])
         # Assert input == output since this tests an identity model
         self.assertEqual(
-            expected_output_bytes, r.content[header_size:],
-            "Expected response body contains correct output binary data: {}; got: {}"
-            .format(expected_output_bytes, r.content[header_size:]))
+            expected_output_bytes,
+            r.content[header_size:],
+            "Expected response body contains correct output binary data: {}; got: {}".format(
+                expected_output_bytes, r.content[header_size:]
+            ),
+        )
 
     def test_raw_binary(self):
         model = "onnx_zero_1_float32"
@@ -80,54 +79,61 @@ def test_byte(self):
         # i.e. BYTE type the element count must be 1
         model = "onnx_zero_1_object_1_element"
         input = "427"
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input,
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(self._get_infer_url(model), data=input, headers=headers)
         r.raise_for_status()
 
         # Get the inference header size so we can locate the output binary data
         header_size = int(r.headers["Inference-Header-Content-Length"])
         # Triton returns BYTES tensor with byte size prepended
-        output = r.content[header_size + 4:].decode()
+        output = r.content[header_size + 4 :].decode()
         self.assertEqual(
-            input, output,
-            "Expected response body contains correct output binary data: {}; got: {}"
-            .format(input, output))
+            input,
+            output,
+            "Expected response body contains correct output binary data: {}; got: {}".format(
+                input, output
+            ),
+        )
 
     def test_byte_too_many_elements(self):
         # Select model that doesn't satisfy constraints for raw binary request
         # i.e. BYTE type the element count must be 1
         model = "onnx_zero_1_object"
         input = "427"
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input,
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(self._get_infer_url(model), data=input, headers=headers)
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
         self.assertIn(
-            "For BYTE datatype raw input, the model must have input shape [1]",
-            r.content.decode())
+            "For BYTE datatype raw input 'INPUT0', the model must have input shape [1]",
+            r.content.decode(),
+        )
 
     def test_multi_variable_dimensions(self):
         # Select model that doesn't satisfy constraints for raw binary request
         # i.e. this model has multiple variable-sized dimensions
         model = "onnx_zero_1_float16"
         input = np.ones([2, 2], dtype=np.float16)
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input.tobytes(),
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(
+            self._get_infer_url(model), data=input.tobytes(), headers=headers
+        )
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
         self.assertIn(
             "The shape of the raw input 'INPUT0' can not be deduced because there are more than one variable-sized dimension",
-            r.content.decode())
+            r.content.decode(),
+        )
 
     def test_multi_inputs(self):
         # Select model that doesn't satisfy constraints for raw binary request
@@ -136,21 +142,25 @@ def test_multi_inputs(self):
         # Use one numpy array, after tobytes() it can be seen as three inputs
         # each with 8 elements (this ambiguity is why this is not allowed)
         input = np.arange(24, dtype=np.float32)
-        headers = {'Inference-Header-Content-Length': '0'}
-        r = requests.post(self._get_infer_url(model),
-                          data=input.tobytes(),
-                          headers=headers)
+        headers = {"Inference-Header-Content-Length": "0"}
+        r = requests.post(
+            self._get_infer_url(model), data=input.tobytes(), headers=headers
+        )
         self.assertEqual(
-            400, r.status_code,
+            400,
+            r.status_code,
             "Expected error code {} returned for the request; got: {}".format(
-                400, r.status_code))
+                400, r.status_code
+            ),
+        )
         self.assertIn(
             "Raw request must only have 1 input (found 1) to be deduced but got 3 inputs in",
-            r.content.decode())
+            r.content.decode(),
+        )
 
     # This is to test that a properly chunk-encoded request by the caller works,
     # though Triton does not specifically do any special chunk handling outside
-    # of underlying HTTP libaries used
+    # of underlying HTTP libraries used
     # Future Enhancement: Test other encodings as they come up
     def test_content_encoding_chunked_manually(self):
         # Similar to test_raw_binary but test with extra headers
@@ -165,9 +175,8 @@ def test_content_encoding_chunked_manually(self):
         # Chunk bytes and line separator
         chunk_encoded_input += input_bytes + b"\r\n"
         # Final byte (0) and end message
-        chunk_encoded_input += b'0\r\n\r\n'
-        self._raw_binary_helper(model, chunk_encoded_input, input_bytes,
-                                extra_headers)
+        chunk_encoded_input += b"0\r\n\r\n"
+        self._raw_binary_helper(model, chunk_encoded_input, input_bytes, extra_headers)
 
     # Test that Python client rejects any "Transfer-Encoding" HTTP headers
     # as we don't specially handle encoding requests for the user through
@@ -183,17 +192,19 @@ def test_content_encoding_unsupported_client(self):
                 inputs = []
                 inputs.append(
                     tritonhttpclient.InferInput(
-                        'INPUT0', np_input.shape,
-                        np_to_triton_dtype(np_input.dtype)))
+                        "INPUT0", np_input.shape, np_to_triton_dtype(np_input.dtype)
+                    )
+                )
                 inputs[0].set_data_from_numpy(np_input)
 
-                with tritonhttpclient.InferenceServerClient(
-                        "localhost:8000") as client:
+                with tritonhttpclient.InferenceServerClient("localhost:8000") as client:
                     # Python client is expected to raise an exception to reject
                     # 'content-encoding' HTTP headers.
-                    with self.assertRaisesRegex(InferenceServerException,
-                                                "Unsupported HTTP header"):
+                    with self.assertRaisesRegex(
+                        InferenceServerException, "Unsupported HTTP header"
+                    ):
                         client.infer(model_name=model, inputs=inputs, headers=headers)
 
-if __name__ == '__main__':
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_http/nginx.conf b/qa/L0_http/nginx.conf
new file mode 100644
index 0000000000..fb62ca719c
--- /dev/null
+++ b/qa/L0_http/nginx.conf
@@ -0,0 +1,57 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+worker_processes  1;
+
+error_log  /var/log/nginx/error.log;
+
+events {
+    worker_connections  1024;
+}
+
+http {
+    # Configure basic authentication
+    auth_basic "Restricted Content";
+    auth_basic_user_file /opt/tritonserver/qa/L0_http/pswd;
+
+    # Define upstream server
+    upstream backend {
+        server localhost:8000;
+    }
+
+    # Define server block for reverse proxy
+    server {
+        listen 8004;
+
+        # Configure location for reverse proxy
+        location / {
+            proxy_pass http://backend;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        }
+    }
+}
diff --git a/qa/L0_http/python_http_aio_test.py b/qa/L0_http/python_http_aio_test.py
new file mode 100755
index 0000000000..bd8d342bb1
--- /dev/null
+++ b/qa/L0_http/python_http_aio_test.py
@@ -0,0 +1,116 @@
+#!/usr/bin/env python
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import unittest
+
+import tritonclient.http.aio as httpclient
+from tritonclient.utils import *
+
+
+class TestHttpAioClient(unittest.IsolatedAsyncioTestCase):
+    """Test if aio rpc can reach the server"""
+
+    async def asyncSetUp(self):
+        self._triton_client = httpclient.InferenceServerClient(url="localhost:8000")
+
+    async def asyncTearDown(self):
+        await self._triton_client.close()
+
+    async def test_is_server_live(self):
+        ret = await self._triton_client.is_server_live()
+        self.assertEqual(ret, True)
+
+    async def test_is_server_ready(self):
+        ret = await self._triton_client.is_server_ready()
+        self.assertEqual(ret, True)
+
+    async def test_is_model_ready(self):
+        ret = await self._triton_client.is_model_ready("simple")
+        self.assertEqual(ret, True)
+
+    async def test_get_server_metadata(self):
+        ret = await self._triton_client.get_server_metadata()
+        self.assertEqual(ret["name"], "triton")
+
+    async def test_get_model_metadata(self):
+        ret = await self._triton_client.get_model_metadata("simple")
+        self.assertEqual(ret["name"], "simple")
+
+    async def test_get_model_config(self):
+        ret = await self._triton_client.get_model_config("simple")
+        self.assertEqual(ret["name"], "simple")
+
+    async def test_get_model_repository_index(self):
+        ret = await self._triton_client.get_model_repository_index()
+        self.assertEqual(len(ret), 7)
+
+    async def test_load_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_unload_model(self):
+        with self.assertRaisesRegex(
+            InferenceServerException,
+            "explicit model load / unload is not allowed if polling is enabled",
+        ):
+            await self._triton_client.load_model("simple")
+
+    async def test_get_inference_statistics(self):
+        await self._triton_client.get_inference_statistics()
+
+    async def test_update_trace_settings(self):
+        await self._triton_client.update_trace_settings()
+
+    async def test_get_trace_settings(self):
+        await self._triton_client.get_trace_settings()
+
+    async def test_get_system_shared_memory_status(self):
+        await self._triton_client.get_system_shared_memory_status()
+
+    async def test_register_system_shared_memory(self):
+        with self.assertRaisesRegex(InferenceServerException, ""):
+            await self._triton_client.register_system_shared_memory("", "", 0)
+
+    async def test_unregister_system_shared_memory(self):
+        await self._triton_client.unregister_system_shared_memory()
+
+    async def test_get_cuda_shared_memory_status(self):
+        await self._triton_client.get_cuda_shared_memory_status()
+
+    async def test_register_cuda_shared_memory(self):
+        with self.assertRaisesRegex(InferenceServerException, ""):
+            await self._triton_client.register_cuda_shared_memory("", b"", 0, 0)
+
+    async def test_unregister_cuda_shared_memory(self):
+        await self._triton_client.unregister_cuda_shared_memory()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_http/test.sh b/qa/L0_http/test.sh
old mode 100644
new mode 100755
index 9b94d1ea2a..7ba219fe15
--- a/qa/L0_http/test.sh
+++ b/qa/L0_http/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,6 +42,10 @@ export CUDA_VISIBLE_DEVICES=0
 
 RET=0
 
+CLIENT_PLUGIN_TEST="./http_client_plugin_test.py"
+BASIC_AUTH_TEST="./http_basic_auth_test.py"
+RESTRICTED_API_TEST="./http_restricted_api_test.py"
+NGINX_CONF="./nginx.conf"
 # On windows the paths invoked by the script (running in WSL) must use
 # /mnt/c when needed but the paths on the tritonserver command-line
 # must be C:/ style.
@@ -52,6 +56,7 @@ if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then
     BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends}
     SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe}
 
+    SIMPLE_AIO_INFER_CLIENT_PY=${SDKDIR}/python/simple_http_aio_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=${SDKDIR}/python/simple_http_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=${SDKDIR}/python/simple_http_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=${SDKDIR}/python/simple_http_async_infer_client.py
@@ -83,6 +88,7 @@ else
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
 
+    SIMPLE_AIO_INFER_CLIENT_PY=../clients/simple_http_aio_infer_client.py
     SIMPLE_HEALTH_CLIENT_PY=../clients/simple_http_health_metadata.py
     SIMPLE_INFER_CLIENT_PY=../clients/simple_http_infer_client.py
     SIMPLE_ASYNC_INFER_CLIENT_PY=../clients/simple_http_async_infer_client.py
@@ -143,6 +149,7 @@ fi
 
 IMAGE=../images/vulture.jpeg
 for i in \
+        $SIMPLE_AIO_INFER_CLIENT_PY \
         $SIMPLE_INFER_CLIENT_PY \
         $SIMPLE_ASYNC_INFER_CLIENT_PY \
         $SIMPLE_IMAGE_CLIENT_PY \
@@ -223,6 +230,13 @@ for i in \
     fi
 done
 
+# Test with json input and output data
+$SIMPLE_STRING_INFER_CLIENT --json-input-data --json-output-data >> ${CLIENT_LOG}.c++.json 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.c++.json
+    RET=1
+fi
+
 # Test while reusing the InferInput and InferRequestedOutput objects
 $SIMPLE_REUSE_INFER_OBJECTS_CLIENT -v >> ${CLIENT_LOG}.c++.reuse 2>&1
 if [ $? -ne 0 ]; then
@@ -230,6 +244,24 @@ if [ $? -ne 0 ]; then
     RET=1
 fi
 
+python $CLIENT_PLUGIN_TEST >> ${CLIENT_LOG}.python.plugin 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin
+    RET=1
+fi
+
+# Create a password file with username:password
+echo -n 'username:' > pswd
+echo "password" | openssl passwd -stdin -apr1 >> pswd
+nginx -c `pwd`/$NGINX_CONF
+
+python $BASIC_AUTH_TEST
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.python.plugin.auth
+    RET=1
+fi
+service nginx stop
+
 # Test with the base path in url.
 $SIMPLE_INFER_CLIENT -u localhost:8000/base_path -v >> ${CLIENT_LOG}.c++.base_path_url 2>&1
 if [ $? -eq 0 ]; then
@@ -268,6 +300,10 @@ if [ $(cat ${CLIENT_LOG}.model_control | grep "PASS" | wc -l) -ne 1 ]; then
     cat ${CLIENT_LOG}.model_control
     RET=1
 fi
+if [ $(cat ${SERVER_LOG} | grep "Invalid config override" | wc -l) -eq 0 ]; then
+    cat ${SERVER_LOG}
+    RET=1
+fi
 
 set -e
 
@@ -469,7 +505,7 @@ wait $SERVER_PID
 # Run cpp client unit test
 rm -rf unit_test_models && mkdir unit_test_models
 cp -r $DATADIR/qa_model_repository/onnx_int32_int32_int32 unit_test_models/.
-cp -r ${MODELDIR}/simple unit_test_models/. 
+cp -r ${MODELDIR}/simple unit_test_models/.
 
 SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=unit_test_models
             --trace-file=global_unittest.log --trace-level=TIMESTAMPS --trace-rate=1"
@@ -507,32 +543,61 @@ SERVER_ARGS="--model-repository=`pwd`/unit_test_models \
              --strict-model-config=false"
 SERVER_LOG="./inference_server_cc_unit_test.load.log"
 CLIENT_LOG="./cc_unit_test.load.log"
+
+for i in \
+   "LoadWithFileOverride" \
+   "LoadWithConfigOverride" \
+   ; do
+    run_server
+    if [ "$SERVER_PID" == "0" ]; then
+        echo -e "\n***\n*** Failed to start $SERVER\n***"
+        cat $SERVER_LOG
+        exit 1
+    fi
+
+    set +e
+    $CC_UNIT_TEST --gtest_filter=HTTP*$i >> ${CLIENT_LOG}.$i 2>&1
+    if [ $? -ne 0 ]; then
+        cat ${CLIENT_LOG}.$i
+        RET=1
+    fi
+    set -e
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+done
+
+# Run python http aio unit test
+PYTHON_HTTP_AIO_TEST=python_http_aio_test.py
+CLIENT_LOG=`pwd`/python_http_aio_test.log
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
     cat $SERVER_LOG
     exit 1
 fi
-
 set +e
-$CC_UNIT_TEST --gtest_filter=HTTP*Load* >> ${CLIENT_LOG} 2>&1
+python $PYTHON_HTTP_AIO_TEST > $CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
-    cat ${CLIENT_LOG}
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python HTTP AsyncIO Test Failed\n***"
     RET=1
 fi
 set -e
-
 kill $SERVER_PID
 wait $SERVER_PID
 
 # Run python unit test
-rm -r ${MODELDIR}/*
+MODELDIR=python_unit_test_models
+mkdir -p $MODELDIR
+rm -rf ${MODELDIR}/*
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_float32 ${MODELDIR}/.
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_object ${MODELDIR}/.
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_float16 ${MODELDIR}/.
 cp -r $DATADIR/qa_identity_model_repository/onnx_zero_3_float32 ${MODELDIR}/.
 cp -r ${MODELDIR}/onnx_zero_1_object ${MODELDIR}/onnx_zero_1_object_1_element && \
-    (cd models/onnx_zero_1_object_1_element && \
+    (cd $MODELDIR/onnx_zero_1_object_1_element && \
         sed -i "s/onnx_zero_1_object/onnx_zero_1_object_1_element/" config.pbtxt && \
         sed -i "0,/-1/{s/-1/1/}" config.pbtxt)
 
@@ -550,7 +615,45 @@ TEST_RESULT_FILE='test_results.txt'
 PYTHON_TEST=http_test.py
 EXPECTED_NUM_TESTS=8
 set +e
-python3 $PYTHON_TEST >$CLIENT_LOG 2>&1
+python $PYTHON_TEST >$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+### LLM / Generate REST API Endpoint Tests ###
+
+# Helper library to parse SSE events
+# https://github.com/mpetazzoni/sseclient
+pip install sseclient-py
+
+SERVER_ARGS="--model-repository=`pwd`/generate_models"
+SERVER_LOG="./inference_server_generate_endpoint_test.log"
+CLIENT_LOG="./generate_endpoint_test.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+## Python Unit Tests
+TEST_RESULT_FILE='test_results.txt'
+PYTHON_TEST=generate_endpoint_test.py
+EXPECTED_NUM_TESTS=14
+set +e
+python $PYTHON_TEST >$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     RET=1
@@ -567,6 +670,74 @@ set -e
 kill $SERVER_PID
 wait $SERVER_PID
 
+### Test Restricted  APIs ###
+### Repeated API not allowed
+
+MODELDIR="`pwd`/models"
+SERVER_ARGS="--model-repository=${MODELDIR}
+             --http-restricted-api=model-repository,health:k1=v1 \
+             --http-restricted-api=metadata,health:k2=v2"
+SERVER_LOG="./http_restricted_endpoint_test.log"
+CLIENT_LOG="./http_restricted_endpoint_test.log"
+run_server
+EXPECTED_MSG="api 'health' can not be specified in multiple config groups"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+### Test Unknown Restricted  API###
+### Unknown API not allowed
+
+MODELDIR="`pwd`/models"
+SERVER_ARGS="--model-repository=${MODELDIR}
+             --http-restricted-api=model-reposit,health:k1=v1 \
+             --http-restricted-api=metadata,health:k2=v2"
+run_server
+EXPECTED_MSG="unknown restricted api 'model-reposit'"
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Expect fail to start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    RET=1
+elif [ `grep -c "${EXPECTED_MSG}" ${SERVER_LOG}` != "1" ]; then
+    echo -e "\n***\n*** Failed. Expected ${EXPECTED_MSG} to be found in log\n***"
+    cat $SERVER_LOG
+    RET=1
+fi
+
+### Test Restricted  APIs ###
+### Restricted model-repository, metadata, and inference
+
+SERVER_ARGS="--model-repository=${MODELDIR} \
+             --http-restricted-api=model-repository:admin-key=admin-value \
+             --http-restricted-api=inference,metadata:infer-key=infer-value"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+set +e
+
+python $RESTRICTED_API_TEST RestrictedAPITest > $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Python HTTP Restricted Protocol Test Failed\n***"
+    RET=1
+fi
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+###
+
 if [ $RET -eq 0 ]; then
     echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_http_fuzz/fuzztest.py b/qa/L0_http_fuzz/fuzztest.py
old mode 100644
new mode 100755
index b8a52a4b2f..8e84ffffc7
--- a/qa/L0_http_fuzz/fuzztest.py
+++ b/qa/L0_http_fuzz/fuzztest.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,30 +27,32 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
+import glob
+import os
+import sqlite3
 import unittest
+
 import test_util as tu
-import sqlite3
 from boofuzz import *
-import glob
-import os
 
 
 class FuzzTest(tu.TestResultCollector):
-
     def _run_fuzz(self, url, logger):
         session = Session(
             target=Target(connection=TCPSocketConnection("127.0.0.1", 8000)),
             fuzz_loggers=logger,
-            keep_web_open=False)
+            keep_web_open=False,
+        )
 
         s_initialize(name="Request" + url)
         with s_block("Request-Line"):
-            s_group("Method", [
-                "GET", "HEAD", "POST", "PUT", "DELETE", "CONNECT", "OPTIONS",
-                "TRACE"
-            ])
+            s_group(
+                "Method",
+                ["GET", "HEAD", "POST", "PUT", "DELETE", "CONNECT", "OPTIONS", "TRACE"],
+            )
             s_delim(" ", name="space-1")
             s_string(url, name="Request-URI")
             s_delim(" ", name="space-2")
@@ -61,28 +65,36 @@ def _run_fuzz(self, url, logger):
 
     def test_failures_from_db(self):
         url_list = [
-            "/v2", "/v2/models/simple", "/v2/models/simple/infer",
-            "/v2/models/simple/versions/v1", "/v2/models/simple/config",
-            "/v2/models/simple/stats", "/v2/models/simple/ready",
-            "/v2/health/ready", "/v2/health/live", "/v2/repository/index",
+            "/v2",
+            "/v2/models/simple",
+            "/v2/models/simple/infer",
+            "/v2/models/simple/versions/v1",
+            "/v2/models/simple/config",
+            "/v2/models/simple/stats",
+            "/v2/models/simple/ready",
+            "/v2/health/ready",
+            "/v2/health/live",
+            "/v2/repository/index",
             "/v2/repository/models/simple/unload",
             "/v2/repository/models/simple/load",
-            "/v2/systemsharedmemory/status", "/v2/systemsharedmemory/register",
+            "/v2/systemsharedmemory/status",
+            "/v2/systemsharedmemory/register",
             "/v2/systemsharedmemory/unregister",
             "/v2/systemsharedmemory/region/xx/status",
-            "/v2/cudasharedmemory/status", "/v2/cudasharedmemory/register",
+            "/v2/cudasharedmemory/status",
+            "/v2/cudasharedmemory/register",
             "/v2/cudasharedmemory/unregister",
-            "/v2/cudasharedmemory/region/xx/status"
+            "/v2/cudasharedmemory/region/xx/status",
         ]
 
-        csv_log = open('fuzz_results.csv', 'w')
+        csv_log = open("fuzz_results.csv", "w")
         logger = [FuzzLoggerCsv(file_handle=csv_log)]
 
         for url in url_list:
             self._run_fuzz(url, logger)
 
             # Get latest db file
-            files = glob.glob('boofuzz-results/*')
+            files = glob.glob("boofuzz-results/*")
             dbfile = max(files, key=os.path.getctime)
 
             conn = sqlite3.connect(dbfile)
@@ -90,10 +102,8 @@ def test_failures_from_db(self):
 
             # Get number of failures, should be 0
             self.assertEqual(
-                len([
-                    x for x in c.execute(
-                        "SELECT * FROM steps WHERE type=\"fail\"")
-                ]), 0)
+                len([x for x in c.execute('SELECT * FROM steps WHERE type="fail"')]), 0
+            )
 
 
 if __name__ == "__main__":
diff --git a/qa/L0_http_fuzz/test.sh b/qa/L0_http_fuzz/test.sh
old mode 100644
new mode 100755
index e56a6018e8..f721135698
--- a/qa/L0_http_fuzz/test.sh
+++ b/qa/L0_http_fuzz/test.sh
@@ -55,6 +55,20 @@ SERVER=/opt/tritonserver/bin/tritonserver
 SERVER_ARGS="--model-repository=$DATADIR"
 source ../common/util.sh
 
+# Remove this once foobuzz and tornado packages upgrade to work with python 3.10
+# This test tests the server's ability to handle poor input and not the compatibility
+# with python 3.10. Python 3.8 is ok to use here.
+function_install_python38() {
+    source ../L0_backend_python/common.sh
+    install_conda
+    create_conda_env "3.8" "python-3-8"
+
+    # Install test script dependencies
+    pip3 install --upgrade wheel setuptools boofuzz==0.3.0 numpy pillow attrdict future grpcio requests gsutil \
+                            awscli six grpcio-channelz prettytable virtualenv
+}
+function_install_python38
+
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -65,7 +79,7 @@ fi
 set +e
 
 # Test health
-python $FUZZTEST -v >> ${FUZZ_LOG} 2>&1
+python3 $FUZZTEST -v >> ${FUZZ_LOG} 2>&1
 if [ $? -ne 0 ]; then
     cat ${FUZZ_LOG}
     RET=1
diff --git a/qa/L0_https/test.sh b/qa/L0_https/test.sh
old mode 100644
new mode 100755
index 58473bf735..2c030332e5
--- a/qa/L0_https/test.sh
+++ b/qa/L0_https/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -42,6 +42,7 @@ export CUDA_VISIBLE_DEVICES=0
 
 RET=0
 
+SIMPLE_AIO_INFER_CLIENT_PY=../clients/simple_http_aio_infer_client.py
 SIMPLE_INFER_CLIENT_PY=../clients/simple_http_infer_client.py
 TEST_CLIENT=../clients/simple_http_infer_client
 
@@ -103,6 +104,11 @@ if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.ssl_infer
     RET=1
 fi
+python $SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost --ssl --key-file client.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.ssl_infer.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.ssl_infer.aio
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 --key-file client.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.c++.ssl_infer 2>&1
 if [ $? -ne 0 ]; then
@@ -116,6 +122,11 @@ if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.ssl_infer_insecure
     RET=1
 fi
+python $SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost --ssl --insecure >> ${CLIENT_LOG}.ssl_infer_insecure.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.ssl_infer_insecure.aio
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 --verify-host 0 --verify-peer 0 >> ${CLIENT_LOG}.c++.ssl_infer_insecure 2>&1
 if [ $? -ne 0 ]; then
@@ -124,7 +135,7 @@ if [ $? -ne 0 ]; then
 fi
 
 # Test failure cases for SSL
-# Try without SSL 
+# Try without SSL
 $SIMPLE_INFER_CLIENT_PY -v -u localhost >> ${CLIENT_LOG}.no_ssl_fail_infer 2>&1
 if [ $? -ne 0 ]; then
     cat ${CLIENT_LOG}.no_ssl_fail_infer
@@ -132,6 +143,13 @@ if [ $? -ne 0 ]; then
 else
     RET=1
 fi
+$SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost >> ${CLIENT_LOG}.no_ssl_fail_infer.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.no_ssl_fail_infer.aio
+    echo -e "\n***\n*** Expected test failure\n***"
+else
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 >> ${CLIENT_LOG}.c++.no_ssl_fail_infer 2>&1
 if [ $? -ne 0 ]; then
@@ -150,6 +168,13 @@ if [ $? -ne 0 ]; then
 else
     RET=1
 fi
+$SIMPLE_AIO_INFER_CLIENT_PY -v -u localhost --ssl --key-file client2.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.ssl_wrong_key.aio 2>&1
+if [ $? -ne 0 ]; then
+    cat ${CLIENT_LOG}.ssl_wrong_key.aio
+    echo -e "\n***\n*** Expected test failure\n***"
+else
+    RET=1
+fi
 
 $TEST_CLIENT -v -u https://localhost:443 --key-file client2.key --cert-file client.crt --ca-certs ca.crt >> ${CLIENT_LOG}.c++.ssl_wrong_key 2>&1
 if [ $? -ne 0 ]; then
diff --git a/qa/L0_implicit_state/implicit_state.py b/qa/L0_implicit_state/implicit_state.py
old mode 100644
new mode 100755
index 72d8bb1a37..2cdf7ff2e0
--- a/qa/L0_implicit_state/implicit_state.py
+++ b/qa/L0_implicit_state/implicit_state.py
@@ -1,5 +1,5 @@
 #!/usr/bin/env python
-# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,77 +27,185 @@
 
 import sys
 
-sys.path.append('../common')
+sys.path.append("../common")
 
-import argparse
-import numpy as np
 import os
+import unittest
 from builtins import range
+
+import numpy as np
+import test_util as tu
 import tritonclient.http as tritonhttpclient
 from tritonclient.utils import InferenceServerException
-import unittest
-import test_util as tu
-BACKENDS = os.environ.get('BACKENDS', "onnx plan")
 
+BACKENDS = os.environ.get("BACKENDS", "onnx plan libtorch")
 
-class ImplicitStateTest(tu.TestResultCollector):
 
+class ImplicitStateTest(tu.TestResultCollector):
     def test_no_implicit_state(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs.append(tritonhttpclient.InferInput('TEST_CASE', [1], 'INT32'))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
         inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
         inputs[1].set_data_from_numpy(np.asarray([0], dtype=np.int32))
 
         with self.assertRaises(InferenceServerException) as e:
-            triton_client.infer(model_name="no_implicit_state", inputs=inputs, sequence_id=1, sequence_start=True)
+            triton_client.infer(
+                model_name="no_implicit_state",
+                inputs=inputs,
+                sequence_id=1,
+                sequence_start=True,
+            )
 
-        self.assertEqual(str(e.exception), "unable to add state 'undefined_state'. State configuration is missing for model 'no_implicit_state'.")
+        err_str = str(e.exception).lower()
+        self.assertIn("unable to add state 'undefined_state'", err_str)
+        self.assertIn(
+            "state configuration is missing for model 'no_implicit_state'", err_str
+        )
 
     def test_wrong_implicit_state_name(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs.append(tritonhttpclient.InferInput('TEST_CASE', [1], 'INT32'))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
         inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
         inputs[1].set_data_from_numpy(np.asarray([0], dtype=np.int32))
 
         with self.assertRaises(InferenceServerException) as e:
-            triton_client.infer(model_name="wrong_internal_state", inputs=inputs, sequence_id=2, sequence_start=True)
+            triton_client.infer(
+                model_name="wrong_internal_state",
+                inputs=inputs,
+                sequence_id=2,
+                sequence_start=True,
+            )
+
+        err_str = str(e.exception).lower()
+        self.assertIn("state 'undefined_state' is not a valid state name", err_str)
 
-        self.assertEqual(str(e.exception), "state 'undefined_state' is not a valid state name.")
+    def test_implicit_state_single_buffer(self):
+        triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
+        inputs = []
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
+        inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.asarray([2], dtype=np.int32))
+
+        triton_client.infer(
+            model_name="single_state_buffer",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=True,
+            sequence_end=False,
+        )
+
+        triton_client.infer(
+            model_name="single_state_buffer",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=False,
+            sequence_end=True,
+        )
+
+    def test_implicit_state_growable_memory(self):
+        triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
+        inputs = []
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
+        inputs[0].set_data_from_numpy(np.random.randint(5, size=[1], dtype=np.int32))
+        inputs[1].set_data_from_numpy(np.asarray([3], dtype=np.int32))
+
+        output = triton_client.infer(
+            model_name="growable_memory",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=True,
+            sequence_end=False,
+        )
+        output_state = output.as_numpy("OUTPUT_STATE")
+        expected_output_state = np.zeros(output_state.shape, dtype=np.int8)
+        np.testing.assert_equal(output_state, expected_output_state)
+
+        output = triton_client.infer(
+            model_name="growable_memory",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=False,
+            sequence_end=False,
+        )
+        output_state = output.as_numpy("OUTPUT_STATE")
+        expected_output_state = np.concatenate(
+            [expected_output_state, np.ones(expected_output_state.shape, dtype=np.int8)]
+        )
+        np.testing.assert_equal(output_state, expected_output_state)
+
+        output = triton_client.infer(
+            model_name="growable_memory",
+            inputs=inputs,
+            sequence_id=2,
+            sequence_start=False,
+            sequence_end=False,
+        )
+        output_state = output.as_numpy("OUTPUT_STATE")
+        expected_output_state = np.concatenate(
+            [
+                expected_output_state,
+                np.full(
+                    (expected_output_state.shape[0] // 2,), dtype=np.int8, fill_value=2
+                ),
+            ]
+        )
+        np.testing.assert_equal(output_state, expected_output_state)
 
     def test_no_update(self):
-	    # Test implicit state without updating any state
+        # Test implicit state without updating any state
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
         inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs.append(tritonhttpclient.InferInput('TEST_CASE', [1], 'INT32'))
+        inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+        inputs.append(tritonhttpclient.InferInput("TEST_CASE", [1], "INT32"))
         inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
         inputs[1].set_data_from_numpy(np.asarray([1], dtype=np.int32))
         correlation_id = 3
 
         # Make sure the state is never updated.
-        result_start = triton_client.infer(model_name="no_state_update", inputs=inputs, sequence_id=correlation_id, sequence_start=True)
-        self.assertEqual(result_start.as_numpy('OUTPUT')[0], 1)
+        result_start = triton_client.infer(
+            model_name="no_state_update",
+            inputs=inputs,
+            sequence_id=correlation_id,
+            sequence_start=True,
+        )
+        self.assertEqual(result_start.as_numpy("OUTPUT")[0], 1)
         for _ in range(10):
-            result = triton_client.infer(model_name="no_state_update", inputs=inputs, sequence_id=correlation_id)
-            self.assertEqual(result.as_numpy('OUTPUT')[0], 1)
+            result = triton_client.infer(
+                model_name="no_state_update", inputs=inputs, sequence_id=correlation_id
+            )
+            self.assertEqual(result.as_numpy("OUTPUT")[0], 1)
 
-        result_start = triton_client.infer(model_name="no_state_update", inputs=inputs, sequence_id=correlation_id, sequence_end=True)
-        self.assertEqual(result.as_numpy('OUTPUT')[0], 1)
+        _ = triton_client.infer(
+            model_name="no_state_update",
+            inputs=inputs,
+            sequence_id=correlation_id,
+            sequence_end=True,
+        )
+        self.assertEqual(result.as_numpy("OUTPUT")[0], 1)
 
     def test_request_output_not_allowed(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
-        inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
-
-        outputs = []
-        outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT_STATE'))
 
         for backend in BACKENDS.split(" "):
+            inputs = []
+            if backend.strip() == "libtorch":
+                inputs.append(tritonhttpclient.InferInput("INPUT__0", [1], "INT32"))
+            else:
+                inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+            inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
+
+            outputs = []
+            if backend.strip() == "libtorch":
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE__1"))
+            else:
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE"))
+
             with self.assertRaises(InferenceServerException) as e:
                 triton_client.infer(
                     model_name=f"{backend}_nobatch_sequence_int32",
@@ -105,31 +213,52 @@ def test_request_output_not_allowed(self):
                     outputs=outputs,
                     sequence_id=1,
                     sequence_start=True,
-                    sequence_end=True)
-            self.assertTrue(str(e.exception).startswith("unexpected inference output 'OUTPUT_STATE' for model"))
+                    sequence_end=True,
+                )
+            if backend.strip() == "libtorch":
+                self.assertIn(
+                    "unexpected inference output 'OUTPUT_STATE__1' for model",
+                    str(e.exception),
+                )
+            else:
+                self.assertIn(
+                    "unexpected inference output 'OUTPUT_STATE' for model",
+                    str(e.exception),
+                )
 
     def test_request_output(self):
         triton_client = tritonhttpclient.InferenceServerClient("localhost:8000")
-        inputs = []
-        inputs.append(tritonhttpclient.InferInput('INPUT', [1], 'INT32'))
-        inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
+        for backend in BACKENDS.split(" "):
+            inputs = []
+            if backend.strip() == "libtorch":
+                inputs.append(tritonhttpclient.InferInput("INPUT__0", [1], "INT32"))
+            else:
+                inputs.append(tritonhttpclient.InferInput("INPUT", [1], "INT32"))
+            inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.int32))
 
-        outputs = []
-        outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT_STATE'))
-        outputs.append(tritonhttpclient.InferRequestedOutput('OUTPUT'))
+            outputs = []
+            if backend.strip() == "libtorch":
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE__1"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT__0"))
+            else:
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT_STATE"))
+                outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT"))
 
-        for backend in BACKENDS.split(" "):
             result = triton_client.infer(
-                    model_name=f"{backend}_nobatch_sequence_int32_output",
-                    inputs=inputs,
-                    outputs=outputs,
-                    sequence_id=1,
-                    sequence_start=True,
-                    sequence_end=True)
-            self.assertTrue(result.as_numpy('OUTPUT_STATE')[0], 1)
-            self.assertTrue(result.as_numpy('OUTPUT')[0], 1)
+                model_name=f"{backend}_nobatch_sequence_int32_output",
+                inputs=inputs,
+                outputs=outputs,
+                sequence_id=1,
+                sequence_start=True,
+                sequence_end=True,
+            )
+            if backend.strip() == "libtorch":
+                self.assertTrue(result.as_numpy("OUTPUT_STATE__1")[0], 1)
+                self.assertTrue(result.as_numpy("OUTPUT__0")[0], 1)
+            else:
+                self.assertTrue(result.as_numpy("OUTPUT_STATE")[0], 1)
+                self.assertTrue(result.as_numpy("OUTPUT")[0], 1)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
-
diff --git a/qa/L0_implicit_state/models/growable_memory/config.pbtxt b/qa/L0_implicit_state/models/growable_memory/config.pbtxt
new file mode 100644
index 0000000000..0a7920bdf1
--- /dev/null
+++ b/qa/L0_implicit_state/models/growable_memory/config.pbtxt
@@ -0,0 +1,103 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "growable_memory"
+backend: "implicit_state"
+max_batch_size: 0
+sequence_batching {
+  control_input [
+    {
+      name: "START"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_START
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "READY"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_READY
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "END"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_END
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    }
+  ]
+  state [
+    {
+        input_name: "INPUT_STATE"
+        output_name: "OUTPUT_STATE"
+        data_type: TYPE_INT8
+        dims: [1024, 1024]
+        use_same_buffer_for_input_output: true
+        use_growable_memory: true
+    }
+  ]
+}
+
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  },
+  {
+    name: "TEST_CASE"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  },
+  {
+    name: "OUTPUT_STATE"
+    data_type: TYPE_INT8
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_GPU
+  }
+]
diff --git a/qa/L0_implicit_state/models/single_state_buffer/config.pbtxt b/qa/L0_implicit_state/models/single_state_buffer/config.pbtxt
new file mode 100644
index 0000000000..0f72d772a6
--- /dev/null
+++ b/qa/L0_implicit_state/models/single_state_buffer/config.pbtxt
@@ -0,0 +1,97 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "single_state_buffer"
+backend: "implicit_state"
+max_batch_size: 0
+sequence_batching {
+  control_input [
+    {
+      name: "START"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_START
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "READY"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_READY
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    },
+    {
+      name: "END"
+      control [
+        {
+          kind: CONTROL_SEQUENCE_END
+          fp32_false_true: [ 0, 1 ]
+        }
+      ]
+    }
+  ]
+  state [
+    {
+        input_name: "INPUT_STATE"
+        output_name: "OUTPUT_STATE"
+        data_type: TYPE_INT32
+        dims: 1
+        use_same_buffer_for_input_output: true
+    }
+  ]
+}
+
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  },
+  {
+    name: "TEST_CASE"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+
+instance_group [
+  {
+    count: 1
+    kind : KIND_CPU
+  }
+]
diff --git a/qa/L0_implicit_state/test.sh b/qa/L0_implicit_state/test.sh
old mode 100644
new mode 100755
index 04436ec8d5..0722d29be1
--- a/qa/L0_implicit_state/test.sh
+++ b/qa/L0_implicit_state/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -41,14 +41,16 @@ DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"}
 TEST_RESULT_FILE='test_results.txt'
 
 export ENSEMBLES=0
-BACKENDS=${BACKENDS:="onnx plan"}
+BACKENDS=${BACKENDS:="libtorch onnx plan"}
 export BACKENDS
 export IMPLICIT_STATE=1
 INITIAL_STATE_ZERO=${INITIAL_STATE_ZERO:="0"}
 INITIAL_STATE_FILE=${INITIAL_STATE_FILE:="0"}
+SINGLE_STATE_BUFFER=${SINGLE_STATE_BUFFER:="0"}
 
 export INITIAL_STATE_ZERO
 export INITIAL_STATE_FILE
+export SINGLE_STATE_BUFFER
 
 MODELDIR=${MODELDIR:=`pwd`/models}
 TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
@@ -60,10 +62,14 @@ source ../common/util.sh
 cp ./libtriton_implicit_state.so models/no_implicit_state/
 cp ./libtriton_implicit_state.so models/no_state_update/
 cp ./libtriton_implicit_state.so models/wrong_internal_state/
+cp ./libtriton_implicit_state.so models/single_state_buffer/
+cp ./libtriton_implicit_state.so models/growable_memory/
 
 mkdir -p models/no_implicit_state/1/
 mkdir -p models/no_state_update/1/
 mkdir -p models/wrong_internal_state/1/
+mkdir -p models/single_state_buffer/1/
+mkdir -p models/growable_memory/1/
 
 for BACKEND in $BACKENDS; do
     dtype="int32"
@@ -78,15 +84,21 @@ for BACKEND in $BACKENDS; do
     rm -rf models/$model_name_allow_output
     cp -r $DATADIR/qa_sequence_implicit_model_repository/$model_name models/$model_name_allow_output
 
-    (cd models/$model_name_allow_output && \
-        sed -i "s/^name:.*/name: \"$model_name_allow_output\"/" config.pbtxt && \
-        echo -e "output [{ name: \"OUTPUT_STATE\" \n data_type: TYPE_INT32 \n dims: [ 1 ] }]" >> config.pbtxt)
+    if [ $BACKEND == "libtorch" ]; then
+    	(cd models/$model_name_allow_output && \
+    	    sed -i "s/^name:.*/name: \"$model_name_allow_output\"/" config.pbtxt && \
+    	    echo -e "output [{ name: \"OUTPUT_STATE__1\" \n data_type: TYPE_INT32 \n dims: [ 1 ] }]" >> config.pbtxt)
+    else
+    	(cd models/$model_name_allow_output && \
+    	    sed -i "s/^name:.*/name: \"$model_name_allow_output\"/" config.pbtxt && \
+    	    echo -e "output [{ name: \"OUTPUT_STATE\" \n data_type: TYPE_INT32 \n dims: [ 1 ] }]" >> config.pbtxt)
+    fi
 done
 
 CLIENT_LOG=`pwd`/client.log
-SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR}"
+SERVER_ARGS="--backend-directory=${BACKEND_DIR} --model-repository=${MODELDIR} --cuda-virtual-address-size=0:$((1024*1024*4))"
 IMPLICIT_STATE_CLIENT='implicit_state.py'
-EXPECTED_TEST_NUM=5
+EXPECTED_TEST_NUM=7
 rm -rf $CLIENT_LOG
 
 run_server
diff --git a/qa/L0_infer/infer_test.py b/qa/L0_infer/infer_test.py
old mode 100644
new mode 100755
index cb684c74db..3d5e116b4b
--- a/qa/L0_infer/infer_test.py
+++ b/qa/L0_infer/infer_test.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,64 +27,105 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
+import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
-
 from tritonclient.utils import *
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
-CPU_ONLY = (os.environ.get('TRITON_SERVER_CPU_ONLY') is not None)
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
+CPU_ONLY = os.environ.get("TRITON_SERVER_CPU_ONLY") is not None
+TEST_VALGRIND = bool(int(os.environ.get("TEST_VALGRIND", 0)))
 
-USE_GRPC = (os.environ.get('USE_GRPC', 1) != "0")
-USE_HTTP = (os.environ.get('USE_HTTP', 1) != "0")
+USE_GRPC = os.environ.get("USE_GRPC", 1) != "0"
+USE_HTTP = os.environ.get("USE_HTTP", 1) != "0"
 assert USE_GRPC or USE_HTTP, "USE_GRPC or USE_HTTP must be non-zero"
 
 BACKENDS = os.environ.get(
-    'BACKENDS',
-    "graphdef savedmodel onnx libtorch plan python python_dlpack openvino")
-ENSEMBLES = bool(int(os.environ.get('ENSEMBLES', 1)))
+    "BACKENDS", "graphdef savedmodel onnx libtorch plan python python_dlpack openvino"
+)
+ENSEMBLES = bool(int(os.environ.get("ENSEMBLES", 1)))
+NOBATCH = bool(int(os.environ.get("NOBATCH", 1)))
+BATCH = bool(int(os.environ.get("BATCH", 1)))
 
 np_dtype_string = np.dtype(object)
 
+# 60 sec is the default value
+NETWORK_TIMEOUT = 300.0 if TEST_VALGRIND else 60.0
 
-class InferTest(tu.TestResultCollector):
 
-    def _full_exact(self, input_dtype, output0_dtype, output1_dtype,
-                    output0_raw, output1_raw, swap):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=USE_HTTP,
-                                use_grpc=USE_GRPC,
-                                use_http_json_tensors=True,
-                                skip_request_id_check=True,
-                                use_streaming=True,
-                                correlation_id=0):
+class InferTest(tu.TestResultCollector):
+    def _full_exact(
+        self,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        output0_raw,
+        output1_raw,
+        swap,
+        network_timeout=NETWORK_TIMEOUT,
+    ):
+        def _infer_exact_helper(
+            tester,
+            pf,
+            tensor_shape,
+            batch_size,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_raw=True,
+            output1_raw=True,
+            model_version=None,
+            swap=False,
+            outputs=("OUTPUT0", "OUTPUT1"),
+            use_http=USE_HTTP,
+            use_grpc=USE_GRPC,
+            use_http_json_tensors=True,
+            skip_request_id_check=True,
+            use_streaming=True,
+            correlation_id=0,
+            network_timeout=NETWORK_TIMEOUT,
+        ):
             for bs in (1, batch_size):
                 # model that does not support batching
-                if bs == 1:
+                if NOBATCH:
+                    if bs == 1:
+                        iu.infer_exact(
+                            tester,
+                            pf + "_nobatch",
+                            tensor_shape,
+                            bs,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            model_version=model_version,
+                            swap=swap,
+                            outputs=outputs,
+                            use_http=use_http,
+                            use_grpc=use_grpc,
+                            use_http_json_tensors=use_http_json_tensors,
+                            skip_request_id_check=skip_request_id_check,
+                            use_streaming=use_streaming,
+                            correlation_id=correlation_id,
+                            use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                            network_timeout=network_timeout,
+                        )
+
+                if BATCH:
+                    # model that supports batching.
                     iu.infer_exact(
                         tester,
-                        pf + "_nobatch",
-                        tensor_shape,
+                        pf,
+                        (bs,) + tensor_shape,
                         bs,
                         input_dtype,
                         output0_dtype,
@@ -99,29 +142,9 @@ def _infer_exact_helper(tester,
                         use_streaming=use_streaming,
                         correlation_id=correlation_id,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-
-                # model that supports batching.
-                iu.infer_exact(
-                    tester,
-                    pf, (bs,) + tensor_shape,
-                    bs,
-                    input_dtype,
-                    output0_dtype,
-                    output1_dtype,
-                    output0_raw=output0_raw,
-                    output1_raw=output1_raw,
-                    model_version=model_version,
-                    swap=swap,
-                    outputs=outputs,
-                    use_http=use_http,
-                    use_grpc=use_grpc,
-                    use_http_json_tensors=use_http_json_tensors,
-                    skip_request_id_check=skip_request_id_check,
-                    use_streaming=use_streaming,
-                    correlation_id=correlation_id,
-                    use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        network_timeout=network_timeout,
+                    )
 
         input_size = 16
 
@@ -129,88 +152,131 @@ def _infer_exact_helper(tester,
         ensemble_prefix = [""]
         if ENSEMBLES:
             for prefix in all_ensemble_prefix:
-                if tu.validate_for_ensemble_model(prefix, input_dtype,
-                                                  output0_dtype, output1_dtype,
-                                                  (input_size,), (input_size,),
-                                                  (input_size,)):
+                if tu.validate_for_ensemble_model(
+                    prefix,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    (input_size,),
+                    (input_size,),
+                    (input_size,),
+                ):
                     ensemble_prefix.append(prefix)
 
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    (input_size,), (input_size,),
-                                    (input_size,)):
+        if tu.validate_for_tf_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
             for prefix in ensemble_prefix:
                 for pf in ["graphdef", "savedmodel"]:
                     if pf in BACKENDS:
-                        _infer_exact_helper(self,
-                                            prefix + pf, (input_size,),
-                                            8,
-                                            input_dtype,
-                                            output0_dtype,
-                                            output1_dtype,
-                                            output0_raw=output0_raw,
-                                            output1_raw=output1_raw,
-                                            swap=swap)
+                        _infer_exact_helper(
+                            self,
+                            prefix + pf,
+                            (input_size,),
+                            8,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            swap=swap,
+                            network_timeout=network_timeout,
+                        )
 
         if not CPU_ONLY and tu.validate_for_trt_model(
-                input_dtype, output0_dtype, output1_dtype, (input_size, 1, 1),
-            (input_size, 1, 1), (input_size, 1, 1)):
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+            (input_size, 1, 1),
+        ):
             for prefix in ensemble_prefix:
-                if 'plan' in BACKENDS:
+                if "plan" in BACKENDS:
                     if input_dtype == np.int8:
-                        _infer_exact_helper(self,
-                                            prefix + 'plan', (input_size, 1, 1),
-                                            8,
-                                            input_dtype,
-                                            output0_dtype,
-                                            output1_dtype,
-                                            output0_raw=output0_raw,
-                                            output1_raw=output1_raw,
-                                            swap=swap)
+                        _infer_exact_helper(
+                            self,
+                            prefix + "plan",
+                            (input_size, 1, 1),
+                            8,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            swap=swap,
+                        )
                     else:
-                        _infer_exact_helper(self,
-                                            prefix + 'plan', (input_size,),
-                                            8,
-                                            input_dtype,
-                                            output0_dtype,
-                                            output1_dtype,
-                                            output0_raw=output0_raw,
-                                            output1_raw=output1_raw,
-                                            swap=swap)
-
-        if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      (input_size,), (input_size,),
-                                      (input_size,)):
+                        _infer_exact_helper(
+                            self,
+                            prefix + "plan",
+                            (input_size,),
+                            8,
+                            input_dtype,
+                            output0_dtype,
+                            output1_dtype,
+                            output0_raw=output0_raw,
+                            output1_raw=output1_raw,
+                            swap=swap,
+                        )
+
+        if tu.validate_for_onnx_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
             for prefix in ensemble_prefix:
-                if 'onnx' in BACKENDS:
-                    _infer_exact_helper(self,
-                                        prefix + 'onnx', (input_size,),
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
-
-        if tu.validate_for_libtorch_model(input_dtype, output0_dtype,
-                                          output1_dtype, (input_size,),
-                                          (input_size,), (input_size,)):
+                if "onnx" in BACKENDS:
+                    _infer_exact_helper(
+                        self,
+                        prefix + "onnx",
+                        (input_size,),
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
+
+        if tu.validate_for_libtorch_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            (input_size,),
+            (input_size,),
+            (input_size,),
+        ):
             # Due to PyTorch bug
             # https://github.com/pytorch/pytorch/issues/66930 we can't
             # run this test with int8 input and int32 outputs.
-            if ((input_dtype == np.int8) and (output0_dtype == np.int32) and
-                (output1_dtype == np.int32)):
-                print('skipping pytorch test for int8_int32_int32')
+            if (
+                (input_dtype == np.int8)
+                and (output0_dtype == np.int32)
+                and (output1_dtype == np.int32)
+            ):
+                print("skipping pytorch test for int8_int32_int32")
             else:
                 for prefix in ensemble_prefix:
-                    if 'libtorch' in BACKENDS:
+                    if "libtorch" in BACKENDS:
                         # Skip batching for PyTorch String I/O
-                        if ((input_dtype == np_dtype_string) or
-                            (output0_dtype == np_dtype_string) or
-                            (output1_dtype == np_dtype_string)):
+                        if (
+                            (input_dtype == np_dtype_string)
+                            or (output0_dtype == np_dtype_string)
+                            or (output1_dtype == np_dtype_string)
+                        ):
                             iu.infer_exact(
                                 self,
-                                prefix + 'libtorch_nobatch',
+                                prefix + "libtorch_nobatch",
                                 (input_size,),
                                 1,  # batch_size
                                 input_dtype,
@@ -221,304 +287,382 @@ def _infer_exact_helper(tester,
                                 swap=swap,
                                 use_http=USE_HTTP,
                                 use_grpc=USE_GRPC,
-                                use_system_shared_memory=
-                                TEST_SYSTEM_SHARED_MEMORY,
-                                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                            )
                         else:
-                            _infer_exact_helper(self,
-                                                prefix + 'libtorch',
-                                                (input_size,),
-                                                8,
-                                                input_dtype,
-                                                output0_dtype,
-                                                output1_dtype,
-                                                output0_raw=output0_raw,
-                                                output1_raw=output1_raw,
-                                                swap=swap)
+                            _infer_exact_helper(
+                                self,
+                                prefix + "libtorch",
+                                (input_size,),
+                                8,
+                                input_dtype,
+                                output0_dtype,
+                                output1_dtype,
+                                output0_raw=output0_raw,
+                                output1_raw=output1_raw,
+                                swap=swap,
+                            )
 
         for prefix in ensemble_prefix:
             if prefix != "":
                 continue
+            if (
+                input_dtype == np.uint8
+                or output0_dtype == np.uint8
+                or output1_dtype == np.uint8
+            ):
+                continue
+
+            if "python_dlpack" in BACKENDS:
+                _infer_exact_helper(
+                    self,
+                    prefix + "python_dlpack",
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
+            elif "python" in BACKENDS:
+                _infer_exact_helper(
+                    self,
+                    prefix + "python",
+                    (input_size,),
+                    8,
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    output0_raw=output0_raw,
+                    output1_raw=output1_raw,
+                    swap=swap,
+                )
 
-            if 'python_dlpack' in BACKENDS:
-                _infer_exact_helper(self,
-                                    prefix + 'python_dlpack', (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
-            elif 'python' in BACKENDS:
-                _infer_exact_helper(self,
-                                    prefix + 'python', (input_size,),
-                                    8,
-                                    input_dtype,
-                                    output0_dtype,
-                                    output1_dtype,
-                                    output0_raw=output0_raw,
-                                    output1_raw=output1_raw,
-                                    swap=swap)
+    def test_raw_uuu(self):
+        self._full_exact(
+            np.uint8, np.uint8, np.uint8, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_bbb(self):
-        self._full_exact(np.int8,
-                         np.int8,
-                         np.int8,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int8, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_sss(self):
-        self._full_exact(np.int16,
-                         np.int16,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int16, np.int16, np.int16, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_iii(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.int32, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=True
+        )
 
     def test_raw_lll(self):
-        self._full_exact(np.int64,
-                         np.int64,
-                         np.int64,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int64, np.int64, np.int64, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_hhh(self):
-        self._full_exact(np.float16,
-                         np.float16,
-                         np.float16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float16,
+            np.float16,
+            np.float16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=True)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=True,
+        )
 
     def test_raw_hff(self):
-        self._full_exact(np.float16,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float16,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_bii(self):
-        self._full_exact(np.int8,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int8, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_ibb(self):
-        self._full_exact(np.int32,
-                         np.int8,
-                         np.int8,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=False
+        )
 
     def test_raw_ibs(self):
-        self._full_exact(np.int32,
-                         np.int8,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32, np.int8, np.int16, output0_raw=True, output1_raw=True, swap=False
+        )
+
+    def test_raw_fuu(self):
+        self._full_exact(
+            np.float32,
+            np.uint8,
+            np.uint8,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
+
+    def test_raw_uff(self):
+        self._full_exact(
+            np.uint8,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
+
+    def test_raw_fuh(self):
+        self._full_exact(
+            np.float32,
+            np.uint8,
+            np.float16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_iff(self):
-        self._full_exact(np.int32,
-                         np.float32,
-                         np.float32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.float32,
+            np.float32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_fii(self):
-        self._full_exact(np.float32,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.float32,
+            np.int32,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ihs(self):
-        self._full_exact(np.int32,
-                         np.float16,
-                         np.int16,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.float16,
+            np.int16,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ooo(self):
-        self._full_exact(np_dtype_string,
-                         np_dtype_string,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np_dtype_string,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_oii(self):
-        self._full_exact(np_dtype_string,
-                         np.int32,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np.int32,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_oio(self):
-        self._full_exact(np_dtype_string,
-                         np.int32,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np.int32,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ooi(self):
-        self._full_exact(np_dtype_string,
-                         np_dtype_string,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np_dtype_string,
+            np_dtype_string,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ioo(self):
-        self._full_exact(np.int32,
-                         np_dtype_string,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np_dtype_string,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_iio(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np_dtype_string,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np_dtype_string,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     def test_raw_ioi(self):
-        self._full_exact(np.int32,
-                         np_dtype_string,
-                         np.int32,
-                         output0_raw=True,
-                         output1_raw=True,
-                         swap=False)
+        self._full_exact(
+            np.int32,
+            np_dtype_string,
+            np.int32,
+            output0_raw=True,
+            output1_raw=True,
+            swap=False,
+        )
 
     # shared memory does not support class output
     if not (TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY):
 
         def test_class_bbb(self):
-            self._full_exact(np.int8,
-                             np.int8,
-                             np.int8,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int8,
+                np.int8,
+                np.int8,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_sss(self):
-            self._full_exact(np.int16,
-                             np.int16,
-                             np.int16,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int16,
+                np.int16,
+                np.int16,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_iii(self):
-            self._full_exact(np.int32,
-                             np.int32,
-                             np.int32,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int32,
+                np.int32,
+                np.int32,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_lll(self):
-            self._full_exact(np.int64,
-                             np.int64,
-                             np.int64,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=False)
+            self._full_exact(
+                np.int64,
+                np.int64,
+                np.int64,
+                output0_raw=False,
+                output1_raw=False,
+                swap=False,
+            )
 
         def test_class_fff(self):
-            self._full_exact(np.float32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.float32,
+                np.float32,
+                np.float32,
+                output0_raw=False,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_class_iff(self):
-            self._full_exact(np.int32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=False,
-                             output1_raw=False,
-                             swap=False)
+            self._full_exact(
+                np.int32,
+                np.float32,
+                np.float32,
+                output0_raw=False,
+                output1_raw=False,
+                swap=False,
+            )
 
         def test_mix_bbb(self):
-            self._full_exact(np.int8,
-                             np.int8,
-                             np.int8,
-                             output0_raw=True,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int8,
+                np.int8,
+                np.int8,
+                output0_raw=True,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_mix_sss(self):
-            self._full_exact(np.int16,
-                             np.int16,
-                             np.int16,
-                             output0_raw=False,
-                             output1_raw=True,
-                             swap=True)
+            self._full_exact(
+                np.int16,
+                np.int16,
+                np.int16,
+                output0_raw=False,
+                output1_raw=True,
+                swap=True,
+            )
 
         def test_mix_iii(self):
-            self._full_exact(np.int32,
-                             np.int32,
-                             np.int32,
-                             output0_raw=True,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.int32,
+                np.int32,
+                np.int32,
+                output0_raw=True,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_mix_lll(self):
-            self._full_exact(np.int64,
-                             np.int64,
-                             np.int64,
-                             output0_raw=False,
-                             output1_raw=True,
-                             swap=False)
+            self._full_exact(
+                np.int64,
+                np.int64,
+                np.int64,
+                output0_raw=False,
+                output1_raw=True,
+                swap=False,
+            )
 
         def test_mix_fff(self):
-            self._full_exact(np.float32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=True,
-                             output1_raw=False,
-                             swap=True)
+            self._full_exact(
+                np.float32,
+                np.float32,
+                np.float32,
+                output0_raw=True,
+                output1_raw=False,
+                swap=True,
+            )
 
         def test_mix_iff(self):
-            self._full_exact(np.int32,
-                             np.float32,
-                             np.float32,
-                             output0_raw=False,
-                             output1_raw=True,
-                             swap=False)
+            self._full_exact(
+                np.int32,
+                np.float32,
+                np.float32,
+                output0_raw=False,
+                output1_raw=True,
+                swap=False,
+            )
 
     def test_raw_version_latest_1(self):
         input_size = 16
@@ -526,7 +670,7 @@ def test_raw_version_latest_1(self):
 
         # There are 3 versions of graphdef_int8_int8_int8 but
         # only version 3 should be available
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
             try:
@@ -543,10 +687,10 @@ def test_raw_version_latest_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
 
             try:
                 iu.infer_exact(
@@ -562,24 +706,26 @@ def test_raw_version_latest_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
-
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int8,
-                           np.int8,
-                           np.int8,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
+
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int8,
+                np.int8,
+                np.int8,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     def test_raw_version_latest_2(self):
         input_size = 16
@@ -587,7 +733,7 @@ def test_raw_version_latest_2(self):
 
         # There are 3 versions of graphdef_int16_int16_int16 but only
         # versions 2 and 3 should be available
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
             try:
@@ -604,37 +750,41 @@ def test_raw_version_latest_2(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
-
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int16,
-                           np.int16,
-                           np.int16,
-                           model_version=2,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int16,
-                           np.int16,
-                           np.int16,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
+
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int16,
+                np.int16,
+                np.int16,
+                model_version=2,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int16,
+                np.int16,
+                np.int16,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     def test_raw_version_all(self):
         input_size = 16
@@ -642,48 +792,54 @@ def test_raw_version_all(self):
 
         # There are 3 versions of *_int32_int32_int32 and all should
         # be available.
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           model_version=1,
-                           swap=False,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           model_version=2,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                model_version=1,
+                swap=False,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                model_version=2,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     def test_raw_version_specific_1(self):
         input_size = 16
@@ -691,22 +847,24 @@ def test_raw_version_specific_1(self):
 
         # There are 3 versions of *_float16_float16_float16 but only
         # version 1 should be available.
-        for platform in ('graphdef', 'savedmodel'):
+        for platform in ("graphdef", "savedmodel"):
             if platform not in BACKENDS:
                 continue
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.float16,
-                           np.float16,
-                           np.float16,
-                           model_version=1,
-                           swap=False,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.float16,
+                np.float16,
+                np.float16,
+                model_version=1,
+                swap=False,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
             try:
                 iu.infer_exact(
@@ -722,10 +880,10 @@ def test_raw_version_specific_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
 
             try:
                 iu.infer_exact(
@@ -741,35 +899,37 @@ def test_raw_version_specific_1(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
 
     def test_raw_version_specific_1_3(self):
         input_size = 16
 
         # There are 3 versions of *_float32_float32_float32 but only
         # versions 1 and 3 should be available.
-        for platform in ('graphdef', 'savedmodel', 'plan'):
-            if platform == 'plan' and CPU_ONLY:
+        for platform in ("graphdef", "savedmodel", "plan"):
+            if platform == "plan" and CPU_ONLY:
                 continue
             if platform not in BACKENDS:
                 continue
             tensor_shape = (1, input_size)
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           model_version=1,
-                           swap=False,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                model_version=1,
+                swap=False,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
             try:
                 iu.infer_exact(
@@ -785,27 +945,29 @@ def test_raw_version_specific_1_3(self):
                     use_http=USE_HTTP,
                     use_grpc=USE_GRPC,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             except InferenceServerException as ex:
-                self.assertTrue(
-                    ex.message().startswith("Request for unknown model"))
-
-            iu.infer_exact(self,
-                           platform,
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           model_version=3,
-                           swap=True,
-                           use_http=USE_HTTP,
-                           use_grpc=USE_GRPC,
-                           use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                           use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                self.assertTrue(ex.message().startswith("Request for unknown model"))
+
+            iu.infer_exact(
+                self,
+                platform,
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                model_version=3,
+                swap=True,
+                use_http=USE_HTTP,
+                use_grpc=USE_GRPC,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
     if ENSEMBLES:
-        if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+        if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
 
             def test_ensemble_mix_platform(self):
                 # Skip on CPU only machine as TensorRT model is used in this ensemble
@@ -814,7 +976,8 @@ def test_ensemble_mix_platform(self):
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_platform", (bs, 16),
+                        "mix_platform",
+                        (bs, 16),
                         bs,
                         np.float32,
                         np.float32,
@@ -822,7 +985,8 @@ def test_ensemble_mix_platform(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         if "graphdef" in BACKENDS:
 
@@ -830,7 +994,8 @@ def test_ensemble_mix_type(self):
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_type", (bs, 16),
+                        "mix_type",
+                        (bs, 16),
                         bs,
                         np.int32,
                         np.float32,
@@ -838,15 +1003,17 @@ def test_ensemble_mix_type(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
-        if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+        if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
 
             def test_ensemble_mix_ensemble(self):
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_ensemble", (bs, 16),
+                        "mix_ensemble",
+                        (bs, 16),
                         bs,
                         np.int32,
                         np.float32,
@@ -854,11 +1021,15 @@ def test_ensemble_mix_ensemble(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
-        if all(x in BACKENDS for x in [
-                'graphdef',
-        ]):
+        if all(
+            x in BACKENDS
+            for x in [
+                "graphdef",
+            ]
+        ):
 
             def test_ensemble_mix_batch_nobatch(self):
                 base_names = ["batch_to_nobatch", "nobatch_to_batch"]
@@ -866,7 +1037,8 @@ def test_ensemble_mix_batch_nobatch(self):
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            name, (bs, 16),
+                            name,
+                            (bs, 16),
                             bs,
                             np.float32,
                             np.float32,
@@ -874,10 +1046,12 @@ def test_ensemble_mix_batch_nobatch(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
                     iu.infer_exact(
                         self,
-                        name + "_nobatch", (8, 16),
+                        name + "_nobatch",
+                        (8, 16),
                         1,
                         np.float32,
                         np.float32,
@@ -885,13 +1059,15 @@ def test_ensemble_mix_batch_nobatch(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
                 # batch -> nobatch -> batch
                 for bs in (1, 8):
                     iu.infer_exact(
                         self,
-                        "mix_nobatch_batch", (bs, 16),
+                        "mix_nobatch_batch",
+                        (bs, 16),
                         bs,
                         np.float32,
                         np.float32,
@@ -899,17 +1075,19 @@ def test_ensemble_mix_batch_nobatch(self):
                         use_http=USE_HTTP,
                         use_grpc=USE_GRPC,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         if not (TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY):
 
             def test_ensemble_label_lookup(self):
-                if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+                if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
                     # Ensemble needs to look up label from the actual model
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            "mix_platform", (bs, 16),
+                            "mix_platform",
+                            (bs, 16),
                             bs,
                             np.float32,
                             np.float32,
@@ -919,14 +1097,16 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
 
-                if all(x in BACKENDS for x in ['graphdef', 'savedmodel']):
+                if all(x in BACKENDS for x in ["graphdef", "savedmodel"]):
                     # Label from the actual model will be passed along the nested ensemble
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            "mix_ensemble", (bs, 16),
+                            "mix_ensemble",
+                            (bs, 16),
                             bs,
                             np.int32,
                             np.float32,
@@ -936,14 +1116,16 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
 
                 if "graphdef" in BACKENDS:
                     # If label file is provided, it will use the provided label file directly
                     try:
                         iu.infer_exact(
                             self,
-                            "wrong_label", (1, 16),
+                            "wrong_label",
+                            (1, 16),
                             1,
                             np.int32,
                             np.float32,
@@ -953,7 +1135,8 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
                     except AssertionError:
                         # Sanity check that infer_exact failed since this ensemble is provided
                         # with unexpected labels
@@ -963,7 +1146,8 @@ def test_ensemble_label_lookup(self):
                     for bs in (1, 8):
                         iu.infer_exact(
                             self,
-                            "label_override", (bs, 16),
+                            "label_override",
+                            (bs, 16),
                             bs,
                             np.int32,
                             np.float32,
@@ -973,8 +1157,9 @@ def test_ensemble_label_lookup(self):
                             use_http=USE_HTTP,
                             use_grpc=USE_GRPC,
                             use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                            use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer/install_and_test.sh b/qa/L0_infer/install_and_test.sh
index f488f510f4..28e5dad52e 100755
--- a/qa/L0_infer/install_and_test.sh
+++ b/qa/L0_infer/install_and_test.sh
@@ -25,7 +25,7 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-# Note: This script is to be used with customized triton containers that need 
+# Note: This script is to be used with customized triton containers that need
 # dependencies to run L0_infer tests
 apt-get update && \
     apt-get install -y --no-install-recommends \
diff --git a/qa/L0_infer/test.sh b/qa/L0_infer/test.sh
index c48d4f8f64..34a669f874 100755
--- a/qa/L0_infer/test.sh
+++ b/qa/L0_infer/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -38,12 +38,14 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then
     REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
 fi
 
+ldconfig || true
+
 export CUDA_VISIBLE_DEVICES=0
 
 TEST_RESULT_FILE='test_results.txt'
 CLIENT_LOG_BASE="./client"
 INFER_TEST=infer_test.py
-SERVER_TIMEOUT=360
+SERVER_TIMEOUT=${SERVER_TIMEOUT:=600}
 
 if [ -z "$TEST_SYSTEM_SHARED_MEMORY" ]; then
     TEST_SYSTEM_SHARED_MEMORY="0"
@@ -61,22 +63,25 @@ if [ "$TEST_VALGRIND" -eq 1 ]; then
     LEAKCHECK_LOG_BASE="./valgrind_test"
     LEAKCHECK=/usr/bin/valgrind
     LEAKCHECK_ARGS_BASE="--leak-check=full --show-leak-kinds=definite --max-threads=3000 --num-callers=20"
-    SERVER_TIMEOUT=3600
+    SERVER_TIMEOUT=4000
     rm -f $LEAKCHECK_LOG_BASE*
+    # Remove 'python', 'python_dlpack' and 'onnx' from BACKENDS and test them
+    # separately below.
+    BACKENDS="graphdef savedmodel libtorch plan openvino"
 fi
 
 if [ "$TEST_SYSTEM_SHARED_MEMORY" -eq 1 ] || [ "$TEST_CUDA_SHARED_MEMORY" -eq 1 ]; then
-  EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="29"}
+    EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="33"}
 else
-  EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="42"}
+    EXPECTED_NUM_TESTS=${EXPECTED_NUM_TESTS:="46"}
 fi
 
-TF_VERSION=${TF_VERSION:=1}
+TF_VERSION=${TF_VERSION:=2}
 TEST_JETSON=${TEST_JETSON:=0}
 
 # Default size (in MB) of shared memory to be used by each python model
-# instance (Default is 64MB)
-DEFAULT_SHM_SIZE_MB=${DEFAULT_SHM_SIZE_MB:=64}
+# instance (Default is 1MB)
+DEFAULT_SHM_SIZE_MB=${DEFAULT_SHM_SIZE_MB:=1}
 DEFAULT_SHM_SIZE_BYTES=$((1024*1024*$DEFAULT_SHM_SIZE_MB))
 
 # On windows the paths invoked by the script (running in WSL) must use
@@ -93,6 +98,14 @@ else
     TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"}
     SERVER=${TRITON_DIR}/bin/tritonserver
     BACKEND_DIR=${TRITON_DIR}/backends
+
+    # PyTorch on SBSA requires libgomp to be loaded first. See the following
+    # GitHub issue for more information:
+    # https://github.com/pytorch/pytorch/issues/2575
+    arch=`uname -m`
+    if [ $arch = "aarch64" ]; then
+      SERVER_LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libgomp.so.1
+    fi
 fi
 
 # Allow more time to exit. Ensemble brings in too many models
@@ -123,31 +136,21 @@ export BACKENDS
 ENSEMBLES=${ENSEMBLES:="1"}
 export ENSEMBLES
 
+# Test for both batch and nobatch models
+NOBATCH=${NOBATCH:="1"}
+export NOBATCH
+BATCH=${BATCH:="1"}
+export BATCH
+
 if [[ $BACKENDS == *"python_dlpack"* ]]; then
-    if [ "$TEST_JETSON" == "0" ]; then
-        if [[ "aarch64" != $(uname -m) ]] ; then
-            pip3 install torch==1.9.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
-        else
-            pip3 install torch==1.9.0 -f https://download.pytorch.org/whl/torch_stable.html
-        fi
+    if [[ "aarch64" != $(uname -m) ]] ; then
+        pip3 install torch==1.13.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+    else
+        pip3 install torch==1.13.0 -f https://download.pytorch.org/whl/torch_stable.html
     fi
 fi
 
-
-for TARGET in cpu gpu; do
-    if [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
-        if [ "$TARGET" == "gpu" ]; then
-            echo -e "Skip GPU testing on CPU-only device"
-            continue
-        fi
-        # set strict readiness=false on CPU-only device to allow
-        # unsuccessful load of TensorRT plans, which require GPU.
-        SERVER_ARGS="--model-repository=${MODELDIR} --strict-readiness=false --exit-on-error=false ${SERVER_ARGS_EXTRA}"
-    fi
-
-    SERVER_LOG=$SERVER_LOG_BASE.${TARGET}.log
-    CLIENT_LOG=$CLIENT_LOG_BASE.${TARGET}.log
-
+function generate_model_repository() {
     rm -fr models && mkdir models
     for BACKEND in $BACKENDS; do
       if [ "$BACKEND" == "python" ] || [ "$BACKEND" == "python_dlpack" ]; then
@@ -204,6 +207,9 @@ for TARGET in cpu gpu; do
             fi
           fi
         done
+      elif [ "$BACKEND" == "plan" ] && [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+        # skip plan_tensorrt models since they don't run on CPU only containers
+        continue
       else
         cp -r ${DATADIR}/qa_model_repository/${BACKEND}* \
           models/.
@@ -214,7 +220,10 @@ for TARGET in cpu gpu; do
 
       # Copy identity backend models and ensembles
       for BACKEND in $BACKENDS; do
-        if [ "$BACKEND" != "python" ] && [ "$BACKEND" != "python_dlpack" ] && [ "$BACKEND" != "openvino" ]; then
+        if [ "$BACKEND" == "plan" ] && [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+            # skip plan_tensorrt models since they don't run on CPU only containers
+            continue
+        elif [ "$BACKEND" != "python" ] && [ "$BACKEND" != "python_dlpack" ] && [ "$BACKEND" != "openvino" ]; then
             cp -r ${DATADIR}/qa_ensemble_model_repository/qa_model_repository/*${BACKEND}* \
               models/.
         fi
@@ -242,7 +251,12 @@ for TARGET in cpu gpu; do
 
     KIND="KIND_GPU" && [[ "$TARGET" == "cpu" ]] && KIND="KIND_CPU"
     for FW in $BACKENDS; do
-      if [ "$FW" != "plan" ] && [ "$FW" != "python" ] && [ "$FW" != "python_dlpack" ] && [ "$FW" != "openvino" ];then
+      if [ "$FW" == "onnx" ] && [ "$TEST_VALGRIND" -eq 1 ]; then
+        # Reduce the instance count to make loading onnx models faster
+        for MC in `ls models/${FW}*/config.pbtxt`; do
+            echo "instance_group [ { kind: ${KIND} count: 1 }]" >> $MC
+        done
+      elif [ "$FW" != "plan" ] && [ "$FW" != "python" ] && [ "$FW" != "python_dlpack" ] && [ "$FW" != "openvino" ];then
         for MC in `ls models/${FW}*/config.pbtxt`; do
             echo "instance_group [ { kind: ${KIND} }]" >> $MC
         done
@@ -269,6 +283,21 @@ for TARGET in cpu gpu; do
             sed -i "s/max_batch_size: 1/max_batch_size: 0/" config.pbtxt && \
             sed -i "s/dims: \[ 1 \]/dims: \[ -1, -1 \]/" config.pbtxt)
 
+}
+
+for TARGET in cpu gpu; do
+    if [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+        if [ "$TARGET" == "gpu" ]; then
+            echo -e "Skip GPU testing on CPU-only device"
+            continue
+        fi
+    fi
+
+    SERVER_LOG=$SERVER_LOG_BASE.${TARGET}.log
+    CLIENT_LOG=$CLIENT_LOG_BASE.${TARGET}.log
+
+    generate_model_repository
+
     # Check if running a memory leak check
     if [ "$TEST_VALGRIND" -eq 1 ]; then
         LEAKCHECK_LOG=$LEAKCHECK_LOG_BASE.${TARGET}.log
@@ -299,7 +328,6 @@ for TARGET in cpu gpu; do
         fi
     fi
 
-
     set -e
 
     kill_server
@@ -314,6 +342,96 @@ for TARGET in cpu gpu; do
     set -e
 done
 
+# Run 'python', 'python_dlpack' and 'onnx' models separately in valgrind test.
+# Loading python and python_dlpack models has OOM issue when running with
+# valgrind, so loading only batch or nobatch models for each time.
+# Loading all the onnx models at once requires more than 12 hours. Loading them
+# separately to reduce the loading time.
+if [ "$TEST_VALGRIND" -eq 1 ]; then
+  TESTING_BACKENDS="python python_dlpack onnx"
+  EXPECTED_NUM_TESTS=42
+  if [[ "aarch64" != $(uname -m) ]] ; then
+      pip3 install torch==1.13.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+  else
+      pip3 install torch==1.13.0 -f https://download.pytorch.org/whl/torch_stable.html
+  fi
+
+  for BACKENDS in $TESTING_BACKENDS; do
+    export BACKENDS
+    for TARGET in cpu gpu; do
+      rm -fr *models
+      generate_model_repository
+      mkdir nobatch_models
+      mv ./models/*nobatch_* ./nobatch_models/.
+      cp -fr ./models/nop_* ./nobatch_models/.
+
+      for BATCHING_MODE in batch nobatch; do
+        if [ "$TRITON_SERVER_CPU_ONLY" == "1" ]; then
+          if [ "$TARGET" == "gpu" ]; then
+              echo -e "Skip GPU testing on CPU-only device"
+              continue
+          fi
+        fi
+
+        SERVER_LOG=$SERVER_LOG_BASE.${TARGET}.${BACKENDS}.${BATCHING_MODE}.log
+        CLIENT_LOG=$CLIENT_LOG_BASE.${TARGET}.${BACKENDS}.${BATCHING_MODE}.log
+
+        if [ "$BATCHING_MODE" == "batch" ]; then
+          NOBATCH="0"
+          export NOBATCH
+          BATCH="1"
+          export BATCH
+          MODELDIR=`pwd`/models
+        else
+          NOBATCH="1"
+          export NOBATCH
+          BATCH="0"
+          export BATCH
+          MODELDIR=`pwd`/nobatch_models
+        fi
+
+        SERVER_ARGS="--model-repository=${MODELDIR} ${SERVER_ARGS_EXTRA}"
+        LEAKCHECK_LOG=$LEAKCHECK_LOG_BASE.${TARGET}.${BACKENDS}.${BATCHING_MODE}.log
+        LEAKCHECK_ARGS="$LEAKCHECK_ARGS_BASE --log-file=$LEAKCHECK_LOG"
+        run_server_leakcheck
+
+        if [ "$SERVER_PID" == "0" ]; then
+            echo -e "\n***\n*** Failed to start $SERVER\n***"
+            cat $SERVER_LOG
+            exit 1
+        fi
+
+        set +e
+
+        python3 $INFER_TEST >$CLIENT_LOG 2>&1
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            RET=1
+        else
+            check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                cat $TEST_RESULT_FILE
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+            fi
+        fi
+
+        set -e
+
+        kill_server
+
+        set +e
+        python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG
+        if [ $? -ne 0 ]; then
+            RET=1
+        fi
+        set -e
+      done
+    done
+  done
+fi
+
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
 else
diff --git a/qa/L0_infer_reshape/infer_reshape_test.py b/qa/L0_infer_reshape/infer_reshape_test.py
old mode 100644
new mode 100755
index 0c8156c98f..e77dcbecaf
--- a/qa/L0_infer_reshape/infer_reshape_test.py
+++ b/qa/L0_infer_reshape/infer_reshape_test.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,169 +27,216 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
+import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
 
 np_dtype_string = np.dtype(object)
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 
 class InferReshapeTest(tu.TestResultCollector):
-
-    def _full_reshape(self,
-                      dtype,
-                      input_shapes,
-                      output_shapes=None,
-                      no_batch=True):
+    def _full_reshape(self, dtype, input_shapes, output_shapes=None, no_batch=True):
         # 'shapes' is list of shapes, one for each input.
         if output_shapes is None:
             output_shapes = input_shapes
 
         # For validation assume any shape can be used...
-        if tu.validate_for_tf_model(dtype, dtype, dtype, input_shapes[0],
-                                    input_shapes[0], input_shapes[0]):
+        if tu.validate_for_tf_model(
+            dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                full_shapes = [[
-                    bs,
-                ] + input_shape for input_shape in input_shapes]
-                full_output_shapes = [[
-                    bs,
-                ] + output_shape for output_shape in output_shapes]
+                full_shapes = [
+                    [
+                        bs,
+                    ]
+                    + input_shape
+                    for input_shape in input_shapes
+                ]
+                full_output_shapes = [
+                    [
+                        bs,
+                    ]
+                    + output_shape
+                    for output_shape in output_shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'graphdef',
+                    "graphdef",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
                 iu.infer_zero(
                     self,
-                    'savedmodel',
+                    "savedmodel",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
             if no_batch:
                 iu.infer_zero(
                     self,
-                    'graphdef_nobatch',
+                    "graphdef_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
                 iu.infer_zero(
                     self,
-                    'savedmodel_nobatch',
+                    "savedmodel_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
-        if tu.validate_for_onnx_model(dtype, dtype, dtype, input_shapes[0],
-                                      input_shapes[0], input_shapes[0]):
+        if tu.validate_for_onnx_model(
+            dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                full_shapes = [[
-                    bs,
-                ] + input_shape for input_shape in input_shapes]
-                full_output_shapes = [[
-                    bs,
-                ] + output_shape for output_shape in output_shapes]
+                full_shapes = [
+                    [
+                        bs,
+                    ]
+                    + input_shape
+                    for input_shape in input_shapes
+                ]
+                full_output_shapes = [
+                    [
+                        bs,
+                    ]
+                    + output_shape
+                    for output_shape in output_shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'onnx',
+                    "onnx",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
             if no_batch:
                 iu.infer_zero(
                     self,
-                    'onnx_nobatch',
+                    "onnx_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
-        # Skip for libtorch string I/O
-        if tu.validate_for_libtorch_model(dtype, dtype, dtype, input_shapes[0],
-                                          input_shapes[0], input_shapes[0]) and \
-                                              (dtype != np_dtype_string):
+        if tu.validate_for_libtorch_model(
+            dtype,
+            dtype,
+            dtype,
+            input_shapes[0],
+            input_shapes[0],
+            input_shapes[0],
+            reshape=True,
+        ):
             # skip variable size reshape on libtorch for now,
             # see "gen_qa_reshape_model.py" for detail
             if dtype != np.int32:
                 # model that does not support batching
-                if no_batch:
+                # skip for libtorch string I/O
+                if no_batch and (dtype != np_dtype_string):
                     iu.infer_zero(
                         self,
-                        'libtorch_nobatch',
+                        "libtorch_nobatch",
                         1,
                         dtype,
                         input_shapes,
                         output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
                 # model that supports batching
                 for bs in (1, 8):
-                    full_shapes = [[
-                        bs,
-                    ] + input_shape for input_shape in input_shapes]
-                    full_output_shapes = [[
-                        bs,
-                    ] + output_shape for output_shape in output_shapes]
+                    full_shapes = [
+                        [
+                            bs,
+                        ]
+                        + input_shape
+                        for input_shape in input_shapes
+                    ]
+                    full_output_shapes = [
+                        [
+                            bs,
+                        ]
+                        + output_shape
+                        for output_shape in output_shapes
+                    ]
                     iu.infer_zero(
                         self,
-                        'libtorch',
+                        "libtorch",
                         bs,
                         dtype,
                         full_shapes,
                         full_output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         for name in ["simple_reshape", "sequence_reshape", "fan_reshape"]:
             # [TODO] Skip variable size reshape on ensemble for now.
             # Need rework on how ensemble for reshape are generated
             if dtype == np.int32:
                 break
-            if tu.validate_for_ensemble_model(name, dtype, dtype, dtype,
-                                              input_shapes[0], input_shapes[0],
-                                              input_shapes[0]):
+            if tu.validate_for_ensemble_model(
+                name,
+                dtype,
+                dtype,
+                dtype,
+                input_shapes[0],
+                input_shapes[0],
+                input_shapes[0],
+            ):
                 # model that supports batching
                 for bs in (1, 8):
-                    full_shapes = [[
-                        bs,
-                    ] + input_shape for input_shape in input_shapes]
-                    full_output_shapes = [[
-                        bs,
-                    ] + output_shape for output_shape in output_shapes]
+                    full_shapes = [
+                        [
+                            bs,
+                        ]
+                        + input_shape
+                        for input_shape in input_shapes
+                    ]
+                    full_output_shapes = [
+                        [
+                            bs,
+                        ]
+                        + output_shape
+                        for output_shape in output_shapes
+                    ]
                     iu.infer_zero(
                         self,
                         name,
@@ -196,58 +245,67 @@ def _full_reshape(self,
                         full_shapes,
                         full_output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
                 # model that does not support batching
                 if no_batch:
                     iu.infer_zero(
                         self,
-                        name + '_nobatch',
+                        name + "_nobatch",
                         1,
                         dtype,
                         input_shapes,
                         output_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
-    def _trt_reshape(self,
-                     dtype,
-                     input_shapes,
-                     output_shapes=None,
-                     no_batch=True):
+    def _trt_reshape(self, dtype, input_shapes, output_shapes=None, no_batch=True):
         # 'shapes' is list of shapes, one for each input.
         if output_shapes is None:
             output_shapes = input_shapes
 
-        if tu.validate_for_trt_model(dtype, dtype, dtype, input_shapes[0],
-                                     input_shapes[0], input_shapes[0]):
+        if tu.validate_for_trt_model(
+            dtype, dtype, dtype, input_shapes[0], input_shapes[0], input_shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                full_shapes = [[
-                    bs,
-                ] + input_shape for input_shape in input_shapes]
-                full_output_shapes = [[
-                    bs,
-                ] + output_shape for output_shape in output_shapes]
+                full_shapes = [
+                    [
+                        bs,
+                    ]
+                    + input_shape
+                    for input_shape in input_shapes
+                ]
+                full_output_shapes = [
+                    [
+                        bs,
+                    ]
+                    + output_shape
+                    for output_shape in output_shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'plan',
+                    "plan",
                     bs,
                     dtype,
                     full_shapes,
                     full_output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
             if no_batch:
                 iu.infer_zero(
                     self,
-                    'plan_nobatch',
+                    "plan_nobatch",
                     1,
                     dtype,
                     input_shapes,
                     output_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
     def test_ff1(self):
         self._full_reshape(np.float32, input_shapes=([1],), no_batch=False)
@@ -260,21 +318,24 @@ def test_ff3(self):
         self._full_reshape(np.float32, input_shapes=([4, 4], [2], [2, 2, 3]))
 
     def test_ff4(self):
-        self._full_reshape(np.float32,
-                           input_shapes=([4, 4], [2], [2, 2, 3], [1]),
-                           output_shapes=([16], [1, 2], [3, 2, 2], [1]))
-        self._trt_reshape(np.float32,
-                          input_shapes=([4, 4], [2], [2, 2, 3], [1]),
-                          output_shapes=([2, 2, 4], [1, 2, 1], [3, 2,
-                                                                2], [1, 1, 1]))
+        self._full_reshape(
+            np.float32,
+            input_shapes=([4, 4], [2], [2, 2, 3], [1]),
+            output_shapes=([16], [1, 2], [3, 2, 2], [1]),
+        )
+        self._trt_reshape(
+            np.float32,
+            input_shapes=([4, 4], [2], [2, 2, 3], [1]),
+            output_shapes=([2, 2, 4], [1, 2, 1], [3, 2, 2], [1, 1, 1]),
+        )
 
     def test_ii1(self):
         self._full_reshape(np.int32, input_shapes=([2, 4, 5, 6],))
 
     def test_ii2(self):
-        self._full_reshape(np.int32,
-                           input_shapes=([4, 1], [2]),
-                           output_shapes=([1, 4], [1, 2]))
+        self._full_reshape(
+            np.int32, input_shapes=([4, 1], [2]), output_shapes=([1, 4], [1, 2])
+        )
 
     def test_ii3(self):
         self._full_reshape(np.int32, input_shapes=([1, 4, 1], [8], [2, 2, 3]))
@@ -283,5 +344,5 @@ def test_oo1(self):
         self._full_reshape(np.object_, input_shapes=([1],), no_batch=False)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer_reshape/test.sh b/qa/L0_infer_reshape/test.sh
index 325e24930d..218be954d9 100755
--- a/qa/L0_infer_reshape/test.sh
+++ b/qa/L0_infer_reshape/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
+# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
diff --git a/qa/L0_infer_variable/infer_variable_test.py b/qa/L0_infer_variable/infer_variable_test.py
old mode 100644
new mode 100755
index 95e31c6962..e5e6470a3c
--- a/qa/L0_infer_variable/infer_variable_test.py
+++ b/qa/L0_infer_variable/infer_variable_test.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,52 +27,54 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
 
 np_dtype_string = np.dtype(object)
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 
 class InferVariableTest(tu.TestResultCollector):
-
-    def _full_exact(self,
-                    input_dtype,
-                    output0_dtype,
-                    output1_dtype,
-                    input_shape,
-                    output0_shape,
-                    output1_shape,
-                    output0_raw=True,
-                    output1_raw=True,
-                    swap=False):
-
-        def _infer_exact_helper(tester,
-                                pf,
-                                tensor_shape,
-                                batch_size,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=True,
-                                output1_raw=True,
-                                model_version=None,
-                                swap=False,
-                                outputs=("OUTPUT0", "OUTPUT1"),
-                                use_http=True,
-                                use_grpc=True,
-                                skip_request_id_check=False,
-                                use_streaming=True,
-                                correlation_id=0):
+    def _full_exact(
+        self,
+        input_dtype,
+        output0_dtype,
+        output1_dtype,
+        input_shape,
+        output0_shape,
+        output1_shape,
+        output0_raw=True,
+        output1_raw=True,
+        swap=False,
+    ):
+        def _infer_exact_helper(
+            tester,
+            pf,
+            tensor_shape,
+            batch_size,
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            output0_raw=True,
+            output1_raw=True,
+            model_version=None,
+            swap=False,
+            outputs=("OUTPUT0", "OUTPUT1"),
+            use_http=True,
+            use_grpc=True,
+            skip_request_id_check=False,
+            use_streaming=True,
+            correlation_id=0,
+        ):
             for bs in (1, batch_size):
                 # model that does not support batching
                 if bs == 1:
@@ -93,15 +97,23 @@ def _infer_exact_helper(tester,
                         use_streaming=use_streaming,
                         correlation_id=correlation_id,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
                 # model that supports batching. Skip for libtorch string I/O
-                elif pf == 'libtorch' and tu.validate_for_libtorch_model(
-                        input_dtype, output0_dtype, output1_dtype, tensor_shape,
-                        tensor_shape, tensor_shape, bs):
+                elif pf == "libtorch" and tu.validate_for_libtorch_model(
+                    input_dtype,
+                    output0_dtype,
+                    output1_dtype,
+                    tensor_shape,
+                    tensor_shape,
+                    tensor_shape,
+                    bs,
+                ):
                     iu.infer_exact(
                         tester,
-                        pf, (bs,) + tensor_shape,
+                        pf,
+                        (bs,) + tensor_shape,
                         bs,
                         input_dtype,
                         output0_dtype,
@@ -117,91 +129,128 @@ def _infer_exact_helper(tester,
                         use_streaming=use_streaming,
                         correlation_id=correlation_id,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
 
         all_ensemble_prefix = ["simple_", "sequence_", "fan_"]
         ensemble_prefix = [""]
         for prefix in all_ensemble_prefix:
-            if tu.validate_for_ensemble_model(prefix, input_dtype,
-                                              output0_dtype, output1_dtype,
-                                              input_shape, input_shape,
-                                              input_shape):
+            if tu.validate_for_ensemble_model(
+                prefix,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                input_shape,
+                input_shape,
+                input_shape,
+            ):
                 ensemble_prefix.append(prefix)
 
-        if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype,
-                                    input_shape, output0_shape, output1_shape):
+        if tu.validate_for_tf_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             for prefix in ensemble_prefix:
                 for pf in ["graphdef", "savedmodel"]:
-                    _infer_exact_helper(self,
-                                        prefix + pf,
-                                        input_shape,
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
-
-        if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype,
-                                     input_shape, output0_shape, output1_shape):
+                    _infer_exact_helper(
+                        self,
+                        prefix + pf,
+                        input_shape,
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
+
+        if tu.validate_for_trt_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             for prefix in ensemble_prefix:
                 if input_dtype == np.int8:
-                    _infer_exact_helper(self,
-                                        prefix + 'plan',
-                                        input_shape + (1, 1),
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
+                    _infer_exact_helper(
+                        self,
+                        prefix + "plan",
+                        input_shape + (1, 1),
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
                 else:
-                    _infer_exact_helper(self,
-                                        prefix + 'plan',
-                                        input_shape,
-                                        8,
-                                        input_dtype,
-                                        output0_dtype,
-                                        output1_dtype,
-                                        output0_raw=output0_raw,
-                                        output1_raw=output1_raw,
-                                        swap=swap)
-
-        if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype,
-                                      input_shape, output0_shape,
-                                      output1_shape):
+                    _infer_exact_helper(
+                        self,
+                        prefix + "plan",
+                        input_shape,
+                        8,
+                        input_dtype,
+                        output0_dtype,
+                        output1_dtype,
+                        output0_raw=output0_raw,
+                        output1_raw=output1_raw,
+                        swap=swap,
+                    )
+
+        if tu.validate_for_onnx_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             # No basic ensemble models are created against custom models [TODO]
-            _infer_exact_helper(self,
-                                'onnx',
-                                input_shape,
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
-
-        if tu.validate_for_libtorch_model(input_dtype, output0_dtype,
-                                          output1_dtype, input_shape,
-                                          output0_shape, output1_shape):
+            _infer_exact_helper(
+                self,
+                "onnx",
+                input_shape,
+                8,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_raw=output0_raw,
+                output1_raw=output1_raw,
+                swap=swap,
+            )
+
+        if tu.validate_for_libtorch_model(
+            input_dtype,
+            output0_dtype,
+            output1_dtype,
+            input_shape,
+            output0_shape,
+            output1_shape,
+        ):
             # No basic ensemble models are created against custom models [TODO]
-            _infer_exact_helper(self,
-                                'libtorch',
-                                input_shape,
-                                8,
-                                input_dtype,
-                                output0_dtype,
-                                output1_dtype,
-                                output0_raw=output0_raw,
-                                output1_raw=output1_raw,
-                                swap=swap)
+            _infer_exact_helper(
+                self,
+                "libtorch",
+                input_shape,
+                8,
+                input_dtype,
+                output0_dtype,
+                output1_dtype,
+                output0_raw=output0_raw,
+                output1_raw=output1_raw,
+                swap=swap,
+            )
 
     def test_raw_fff(self):
-        self._full_exact(np.float32, np.float32, np.float32, (16,), (16,),
-                         (16,))
+        self._full_exact(np.float32, np.float32, np.float32, (16,), (16,), (16,))
 
     def test_raw_fii(self):
         self._full_exact(np.float32, np.int32, np.int32, (2, 8), (2, 8), (2, 8))
@@ -210,8 +259,9 @@ def test_raw_fll(self):
         self._full_exact(np.float32, np.int64, np.int64, (8, 4), (8, 4), (8, 4))
 
     def test_raw_fil(self):
-        self._full_exact(np.float32, np.int32, np.int64, (2, 8, 2), (2, 8, 2),
-                         (2, 8, 2))
+        self._full_exact(
+            np.float32, np.int32, np.int64, (2, 8, 2), (2, 8, 2), (2, 8, 2)
+        )
 
     def test_raw_ffi(self):
         self._full_exact(np.float32, np.float32, np.int32, (16,), (16,), (16,))
@@ -220,95 +270,148 @@ def test_raw_iii(self):
         self._full_exact(np.int32, np.int32, np.int32, (2, 8), (2, 8), (2, 8))
 
     def test_faw_iif(self):
-        self._full_exact(np.int32, np.int32, np.float32, (2, 8, 2), (2, 8, 2),
-                         (2, 8, 2))
+        self._full_exact(
+            np.int32, np.int32, np.float32, (2, 8, 2), (2, 8, 2), (2, 8, 2)
+        )
 
     def test_raw_ooo(self):
-        self._full_exact(np_dtype_string, np_dtype_string, np_dtype_string,
-                         (16,), (16,), (16,))
+        self._full_exact(
+            np_dtype_string, np_dtype_string, np_dtype_string, (16,), (16,), (16,)
+        )
 
     def test_raw_oii(self):
-        self._full_exact(np_dtype_string, np.int32, np.int32, (2, 8), (2, 8),
-                         (2, 8))
+        self._full_exact(np_dtype_string, np.int32, np.int32, (2, 8), (2, 8), (2, 8))
 
     def test_raw_ooi(self):
-        self._full_exact(np_dtype_string, np_dtype_string, np.int32, (8, 4),
-                         (8, 4), (8, 4))
+        self._full_exact(
+            np_dtype_string, np_dtype_string, np.int32, (8, 4), (8, 4), (8, 4)
+        )
 
     def test_raw_oio(self):
-        self._full_exact(np_dtype_string, np.int32, np_dtype_string, (2, 8, 2),
-                         (2, 8, 2), (2, 8, 2))
+        self._full_exact(
+            np_dtype_string, np.int32, np_dtype_string, (2, 8, 2), (2, 8, 2), (2, 8, 2)
+        )
 
     def test_class_fff(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.float32, (16,), (16,), (16,),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.float32,
+            (16,),
+            (16,),
+            (16,),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_fii(self):
-        self._full_exact(np.float32,
-                         np.int32,
-                         np.int32, (2, 8), (2, 8), (2, 8),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.int32,
+            np.int32,
+            (2, 8),
+            (2, 8),
+            (2, 8),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_fll(self):
-        self._full_exact(np.float32,
-                         np.int64,
-                         np.int64, (8, 4), (8, 4), (8, 4),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.int64,
+            np.int64,
+            (8, 4),
+            (8, 4),
+            (8, 4),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_fil(self):
-        self._full_exact(np.float32,
-                         np.int32,
-                         np.int64, (2, 8, 2), (2, 8, 2), (2, 8, 2),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.int32,
+            np.int64,
+            (2, 8, 2),
+            (2, 8, 2),
+            (2, 8, 2),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_ffi(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.int32, (16,), (16,), (16,),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.int32,
+            (16,),
+            (16,),
+            (16,),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_iii(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.int32, (2, 8), (2, 8), (2, 8),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.int32,
+            (2, 8),
+            (2, 8),
+            (2, 8),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_class_iif(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.float32, (2, 8, 2), (2, 8, 2), (2, 8, 2),
-                         output0_raw=False,
-                         output1_raw=False)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.float32,
+            (2, 8, 2),
+            (2, 8, 2),
+            (2, 8, 2),
+            output0_raw=False,
+            output1_raw=False,
+        )
 
     def test_mix_ffi(self):
-        self._full_exact(np.float32,
-                         np.float32,
-                         np.int32, (16,), (16,), (16,),
-                         output0_raw=True,
-                         output1_raw=False)
+        self._full_exact(
+            np.float32,
+            np.float32,
+            np.int32,
+            (16,),
+            (16,),
+            (16,),
+            output0_raw=True,
+            output1_raw=False,
+        )
 
     def test_mix_iii(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.int32, (2, 8), (2, 8), (2, 8),
-                         output0_raw=False,
-                         output1_raw=True)
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.int32,
+            (2, 8),
+            (2, 8),
+            (2, 8),
+            output0_raw=False,
+            output1_raw=True,
+        )
 
     def test_mix_iif(self):
-        self._full_exact(np.int32,
-                         np.int32,
-                         np.float32, (2, 8, 2), (2, 8, 2), (2, 8, 2),
-                         output0_raw=True,
-                         output1_raw=False)
-
-
-if __name__ == '__main__':
+        self._full_exact(
+            np.int32,
+            np.int32,
+            np.float32,
+            (2, 8, 2),
+            (2, 8, 2),
+            (2, 8, 2),
+            output0_raw=True,
+            output1_raw=False,
+        )
+
+
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer_zero/infer_zero_test.py b/qa/L0_infer_zero/infer_zero_test.py
old mode 100644
new mode 100755
index e326529996..9e9b0f4625
--- a/qa/L0_infer_zero/infer_zero_test.py
+++ b/qa/L0_infer_zero/infer_zero_test.py
@@ -1,4 +1,6 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,107 +27,128 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
+import os
 import unittest
-import numpy as np
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import os
 
 np_dtype_string = np.dtype(object)
 
-TEST_SYSTEM_SHARED_MEMORY = bool(
-    int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0)))
-TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY',
-                                                  0)))
+TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0)))
+TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0)))
 
 
 class InferZeroTest(tu.TestResultCollector):
-
     def _full_zero(self, dtype, shapes):
         # 'shapes' is list of shapes, one for each input.
 
         # For validation assume any shape can be used...
-        if tu.validate_for_tf_model(dtype, dtype, dtype, shapes[0], shapes[0],
-                                    shapes[0]):
+        if tu.validate_for_tf_model(
+            dtype, dtype, dtype, shapes[0], shapes[0], shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                batch_shapes = [[
-                    bs,
-                ] + shape for shape in shapes]
+                batch_shapes = [
+                    [
+                        bs,
+                    ]
+                    + shape
+                    for shape in shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'graphdef',
+                    "graphdef",
                     bs,
                     dtype,
                     batch_shapes,
                     batch_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
                 iu.infer_zero(
                     self,
-                    'savedmodel',
+                    "savedmodel",
                     bs,
                     dtype,
                     batch_shapes,
                     batch_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
-            iu.infer_zero(self,
-                          'graphdef_nobatch',
-                          1,
-                          dtype,
-                          shapes,
-                          shapes,
-                          use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                          use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-            iu.infer_zero(self,
-                          'savedmodel_nobatch',
-                          1,
-                          dtype,
-                          shapes,
-                          shapes,
-                          use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                          use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
-
-        if tu.validate_for_onnx_model(dtype, dtype, dtype, shapes[0], shapes[0],
-                                      shapes[0]):
+            iu.infer_zero(
+                self,
+                "graphdef_nobatch",
+                1,
+                dtype,
+                shapes,
+                shapes,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+            iu.infer_zero(
+                self,
+                "savedmodel_nobatch",
+                1,
+                dtype,
+                shapes,
+                shapes,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
+
+        if tu.validate_for_onnx_model(
+            dtype, dtype, dtype, shapes[0], shapes[0], shapes[0]
+        ):
             # model that supports batching
             for bs in (1, 8):
-                batch_shapes = [[
-                    bs,
-                ] + shape for shape in shapes]
+                batch_shapes = [
+                    [
+                        bs,
+                    ]
+                    + shape
+                    for shape in shapes
+                ]
                 iu.infer_zero(
                     self,
-                    'onnx',
+                    "onnx",
                     bs,
                     dtype,
                     batch_shapes,
                     batch_shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
             # model that does not support batching
-            iu.infer_zero(self,
-                          'onnx_nobatch',
-                          1,
-                          dtype,
-                          shapes,
-                          shapes,
-                          use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                          use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+            iu.infer_zero(
+                self,
+                "onnx_nobatch",
+                1,
+                dtype,
+                shapes,
+                shapes,
+                use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
+                use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+            )
 
         for name in ["simple_zero", "sequence_zero", "fan_zero"]:
-            if tu.validate_for_ensemble_model(name, dtype, dtype, dtype,
-                                              shapes[0], shapes[0], shapes[0]):
+            if tu.validate_for_ensemble_model(
+                name, dtype, dtype, dtype, shapes[0], shapes[0], shapes[0]
+            ):
                 # model that supports batching
                 for bs in (1, 8):
-                    batch_shapes = [[
-                        bs,
-                    ] + shape for shape in shapes]
+                    batch_shapes = [
+                        [
+                            bs,
+                        ]
+                        + shape
+                        for shape in shapes
+                    ]
                     iu.infer_zero(
                         self,
                         name,
@@ -134,81 +157,135 @@ def _full_zero(self, dtype, shapes):
                         batch_shapes,
                         batch_shapes,
                         use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                        use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                    )
                 # model that does not support batching
                 iu.infer_zero(
                     self,
-                    name + '_nobatch',
+                    name + "_nobatch",
                     1,
                     dtype,
                     shapes,
                     shapes,
                     use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY,
-                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY)
+                    use_cuda_shared_memory=TEST_CUDA_SHARED_MEMORY,
+                )
 
     def test_ff1_sanity(self):
-        self._full_zero(np.float32, ([
-            1,
-        ],))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_ff1(self):
-        self._full_zero(np.float32, ([
-            0,
-        ],))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_sanity(self):
-        self._full_zero(np.float32, ([
-            1,
-        ], [
-            2,
-        ], [
-            1,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+                [
+                    2,
+                ],
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_ff3_0(self):
-        self._full_zero(np.float32, ([
-            0,
-        ], [
-            0,
-        ], [
-            0,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_1(self):
-        self._full_zero(np.float32, ([
-            0,
-        ], [
-            0,
-        ], [
-            1,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_ff3_2(self):
-        self._full_zero(np.float32, ([
-            0,
-        ], [
-            1,
-        ], [
-            0,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    0,
+                ],
+                [
+                    1,
+                ],
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_3(self):
-        self._full_zero(np.float32, ([
-            1,
-        ], [
-            0,
-        ], [
-            0,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+                [
+                    0,
+                ],
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_ff3_4(self):
-        self._full_zero(np.float32, ([
-            1,
-        ], [
-            0,
-        ], [
-            1,
-        ]))
+        self._full_zero(
+            np.float32,
+            (
+                [
+                    1,
+                ],
+                [
+                    0,
+                ],
+                [
+                    1,
+                ],
+            ),
+        )
 
     def test_hh1_sanity(self):
         self._full_zero(np.float16, ([2, 2],))
@@ -241,14 +318,24 @@ def test_hh3_4(self):
         self._full_zero(np.float16, ([1, 1], [0, 6], [2, 2]))
 
     def test_oo1_sanity(self):
-        self._full_zero(np_dtype_string, ([
-            2,
-        ],))
+        self._full_zero(
+            np_dtype_string,
+            (
+                [
+                    2,
+                ],
+            ),
+        )
 
     def test_oo1(self):
-        self._full_zero(np_dtype_string, ([
-            0,
-        ],))
+        self._full_zero(
+            np_dtype_string,
+            (
+                [
+                    0,
+                ],
+            ),
+        )
 
     def test_oo3_sanity(self):
         self._full_zero(np_dtype_string, ([2, 2], [2, 2], [1, 1]))
@@ -269,15 +356,25 @@ def test_oo3_4(self):
         self._full_zero(np_dtype_string, ([1, 1], [0, 6], [2, 2]))
 
     def test_bb1_sanity(self):
-        self._full_zero(bool, ([
-            10,
-        ],))
+        self._full_zero(
+            bool,
+            (
+                [
+                    10,
+                ],
+            ),
+        )
 
     def test_bb1_0(self):
-        self._full_zero(bool, ([
-            0,
-        ],))
+        self._full_zero(
+            bool,
+            (
+                [
+                    0,
+                ],
+            ),
+        )
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_infer_zero/test.sh b/qa/L0_infer_zero/test.sh
index 7f10f0dd18..02676b2f85 100755
--- a/qa/L0_infer_zero/test.sh
+++ b/qa/L0_infer_zero/test.sh
@@ -55,6 +55,10 @@ rm -fr models && mkdir models
 cp -r /data/inferenceserver/${REPO_VERSION}/qa_identity_model_repository/* models/. && \
     cp -r /data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_identity_model_repository/* models/.
 
+# Remove version-compatible TensorRT models, as they require version-compatibility
+# mode to be turned on when starting the server.
+rm -rf models/plan_compatible*
+
 create_nop_version_dir `pwd`/models
 
 RET=0
diff --git a/qa/L0_inferentia_perf_analyzer/test.sh b/qa/L0_inferentia_perf_analyzer/test.sh
old mode 100644
new mode 100755
index 21e361ee6c..1881e07f87
--- a/qa/L0_inferentia_perf_analyzer/test.sh
+++ b/qa/L0_inferentia_perf_analyzer/test.sh
@@ -25,21 +25,21 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-# First need to set up enviroment
+# First need to set up environment
 if [ ${USE_TENSORFLOW} == "1" ] && [ ${USE_PYTORCH} == "1" ] ; then
     echo " Unsupported test configuration. Only one of USE_TENSORFLOW and USE_PYTORCH can be set to 1."
     exit 0
 elif [ ${USE_TENSORFLOW} == "1" ] ; then
-    echo "Setting up enviroment with tensorflow 1"
+    echo "Setting up environment with tensorflow 1"
     source ${TRITON_PATH}/python_backend/inferentia/scripts/setup.sh -t --tensorflow-version 1
 elif [ ${USE_PYTORCH} == "1" ] ; then
-    echo "Setting up enviroment with pytorch"
+    echo "Setting up environment with pytorch"
     source ${TRITON_PATH}/python_backend/inferentia/scripts/setup.sh -p
-else 
+else
     echo " Unsupported test configuration. USE_TENSORFLOW flag is: ${USE_TENSORFLOW} and USE_PYTORCH flag is: ${USE_PYTORCH}. Only one of them can be set to 1."
     exit 0
 fi
-echo "done setting up enviroment"
+echo "done setting up environment"
 
 REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
 if [ "$#" -ge 1 ]; then
@@ -80,32 +80,32 @@ function create_inferentia_models () {
     for DISABLE_DEFAULT_BATCHING_FLAG in ${DISABLE_DEFAULT_BATCHING_FLAGS}; do
         for BATCHED_FLAG in ${BATCHED_FLAGS}; do
             for TEST_TYPE in ${TEST_TYPES}; do
-                CURR_GEN_SCRIPT="${GEN_SCRIPT} --model_type ${MODEL_TYPE}  
-                --triton_model_dir ${TRITON_PATH}/models_${TEST_TYPE}${BATCHED_FLAG}${TEST_FRAMEWORK}${DISABLE_DEFAULT_BATCHING_FLAG}/add-sub-1x4 
+                CURR_GEN_SCRIPT="${GEN_SCRIPT} --model_type ${MODEL_TYPE}
+                --triton_model_dir ${TRITON_PATH}/models_${TEST_TYPE}${BATCHED_FLAG}${TEST_FRAMEWORK}${DISABLE_DEFAULT_BATCHING_FLAG}/add-sub-1x4
                 --compiled_model ${COMPILED_MODEL}"
                 if [ ${DISABLE_DEFAULT_BATCHING_FLAG} == "_no_batch" ]; then
-                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT} 
+                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
                     --disable_batch_requests_to_neuron"
                 fi
                 if [ ${BATCHED_FLAG} == "_batched_" ]; then
                     CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
-                    --triton_input INPUT__0,INT64,4 INPUT__1,INT64,4 
-                    --triton_output OUTPUT__0,INT64,4 OUTPUT__1,INT64,4          
-                    --enable_dynamic_batching 
-                    --max_batch_size 1000 
-                    --preferred_batch_size 8 
+                    --triton_input INPUT__0,INT64,4 INPUT__1,INT64,4
+                    --triton_output OUTPUT__0,INT64,4 OUTPUT__1,INT64,4
+                    --enable_dynamic_batching
+                    --max_batch_size 1000
+                    --preferred_batch_size 8
                     --max_queue_delay_microseconds 100"
                 else
                     CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
-                    --triton_input INPUT__0,INT64,-1x4 INPUT__1,INT64,-1x4 
+                    --triton_input INPUT__0,INT64,-1x4 INPUT__1,INT64,-1x4
                     --triton_output OUTPUT__0,INT64,-1x4 OUTPUT__1,INT64,-1x4"
                 fi
                 if [ ${TEST_TYPE} == "single" ]; then
-                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}   
+                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
                     --neuron_core_range 0:0"
                 elif [ ${TEST_TYPE} == "multiple" ]; then
-                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT} 
-                    --triton_model_instance_count 3 
+                    CURR_GEN_SCRIPT="${CURR_GEN_SCRIPT}
+                    --triton_model_instance_count 3
                     --neuron_core_range 0:7"
                 fi
                 echo ${CURR_GEN_SCRIPT}
diff --git a/qa/L0_io/test.sh b/qa/L0_io/test.sh
index ac1ad5559e..84ab4fb0c0 100755
--- a/qa/L0_io/test.sh
+++ b/qa/L0_io/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -47,16 +47,14 @@ MODELSDIR=`pwd`/models
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
 ENSEMBLEDIR=/data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_model_repository
 
-export CUDA_VISIBLE_DEVICES=0,1
-
 # Must explicitly set LD_LIBRARY_PATH so that IO_TEST_UTIL can find
 # libtritonserver.so.
 LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH
 
-rm -f $CLIENT_LOG.*
+rm -f $CLIENT_LOG*
 
 # PyTorch is required for the Python backend dlpack add sub models
-pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
 RET=0
 
 # Prepare float32 models with basic config
@@ -70,8 +68,7 @@ for trial in graphdef savedmodel onnx libtorch plan python python_dlpack; do
             cp ../python_models/add_sub/config.pbtxt $MODELSDIR/${full}/. && \
             (cd $MODELSDIR/${full} && \
                     sed -i "s/label_filename:.*//" config.pbtxt && \
-                    sed -i "0,/name:.*/{s/name:.*/name: \"${full}\"/}" config.pbtxt && \
-                                        echo "max_batch_size: 64" >> config.pbtxt)
+                    echo "max_batch_size: 64" >> config.pbtxt)
 
         # ensemble version of the model.
         mkdir -p $MODELSDIR/fan_${full}/1 && \
@@ -148,23 +145,47 @@ cp -r $MODELSDIR/fan_graphdef_float32_float32_float32 $MODELSDIR/fan_${full} &&
 cp -r $ENSEMBLEDIR/nop_TYPE_FP32_-1 $MODELSDIR/. && \
     mkdir -p $MODELSDIR/nop_TYPE_FP32_-1/1
 
+# prepare libtorch multi-device and multi-gpu models
+cp -r ../L0_libtorch_instance_group_kind_model/models/libtorch_multi_device $MODELSDIR/.
+cp ../L0_libtorch_instance_group_kind_model/gen_models.py ./gen_libtorch_model.py
+mkdir -p $MODELSDIR/libtorch_multi_device/1
+mkdir -p $MODELSDIR/libtorch_multi_gpu/1
+cp $MODELSDIR/libtorch_multi_device/config.pbtxt $MODELSDIR/libtorch_multi_gpu/.
+(cd $MODELSDIR/libtorch_multi_gpu && \
+    sed -i "s/name: \"libtorch_multi_device\"/name: \"libtorch_multi_gpu\"/" config.pbtxt)
+
+set +e
+python3 gen_libtorch_model.py >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Error when generating libtorch models. \n***"
+    cat $CLIENT_LOG
+    exit 1
+fi
+set -e
+
+TRIALS="graphdef savedmodel onnx libtorch plan python python_dlpack libtorch_multi_gpu libtorch_multi_device"
 for input_device in -1 0 1; do
     for output_device in -1 0 1; do
-        for trial in graphdef savedmodel onnx libtorch plan python python_dlpack; do
+        for trial in ${TRIALS}; do
             # TensorRT Plan should only be deployed on GPU device
             model_devices="-1 0 1" && [[ "$trial" == "plan" ]] && model_devices="0 1"
+            full=${trial}_float32_float32_float32 && [[ "$trial" == "libtorch_multi"* ]] && full=${trial}
+
             for model_device in $model_devices; do
-                full=${trial}_float32_float32_float32
                 full_log=$CLIENT_LOG.$full.$input_device.$output_device.$model_device
 
                 host_policy=cpu
                 if [ "$model_device" == "-1" ]; then
-                    (cd $MODELSDIR/${full} && \
-                        sed -i "s/instance_group.*/instance_group [{ kind: KIND_CPU }]/" config.pbtxt)
+                    if [[ "$trial" != "libtorch_multi"* ]]; then
+                        (cd $MODELSDIR/${full} && \
+                            sed -i "s/instance_group.*/instance_group [{ kind: KIND_CPU }]/" config.pbtxt)
+                    fi
                 else
                     host_policy=gpu_${model_device}
-                    (cd $MODELSDIR/${full} && \
-                        sed -i "s/instance_group.*/instance_group [{ kind: KIND_GPU, gpus: [${model_device}] }]/" config.pbtxt)
+                    if [[ "$trial" != "libtorch_multi"* ]]; then
+                        (cd $MODELSDIR/${full} && \
+                            sed -i "s/instance_group.*/instance_group [{ kind: KIND_GPU, gpus: [${model_device}] }]/" config.pbtxt)
+                    fi
                 fi
 
                 set +e
@@ -196,14 +217,16 @@ for input_device in -1 0 1; do
                 set -e
 
                 # ensemble
-                set +e
-                $IO_TEST_UTIL -i $input_device -o $output_device -r $MODELSDIR -m fan_$full >>$full_log.ensemble 2>&1
-                if [ $? -ne 0 ]; then
-                    cat $full_log.ensemble
-                    echo -e "\n***\n*** Test Failed\n***"
-                    RET=1
+                if [[ "$trial" != "libtorch_multi"* ]]; then
+                    set +e
+                    $IO_TEST_UTIL -i $input_device -o $output_device -r $MODELSDIR -m fan_$full >>$full_log.ensemble 2>&1
+                    if [ $? -ne 0 ]; then
+                        cat $full_log.ensemble
+                        echo -e "\n***\n*** Test Failed\n***"
+                        RET=1
+                    fi
+                    set -e
                 fi
-                set -e
             done
         done
 
diff --git a/qa/L0_iterative_sequence/iterative_sequence_e2e.py b/qa/L0_iterative_sequence/iterative_sequence_e2e.py
new file mode 100755
index 0000000000..378b6ebe82
--- /dev/null
+++ b/qa/L0_iterative_sequence/iterative_sequence_e2e.py
@@ -0,0 +1,192 @@
+#!/usr/bin/env python
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import json
+
+# GRPC streaming helpers..
+import queue
+import unittest
+from functools import partial
+
+import numpy as np
+import requests
+import sseclient
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from tritonclient.utils import InferenceServerException
+
+MODEL_CONFIG_BASE = """
+{{
+"backend": "iterative_sequence",
+"max_batch_size": 4,
+"input" : [
+  {{
+    "name": "INPUT",
+    "data_type": "TYPE_INT32",
+    "dims": [ 1 ]
+  }}
+],
+"output" : [
+  {{
+    "name": "OUTPUT",
+    "data_type": "TYPE_INT32",
+    "dims": [ 1 ]
+  }}
+],
+"model_transaction_policy" : {{
+  "decoupled": true
+}},
+{},
+"instance_group" : [{{ "kind": "KIND_CPU" }}]
+}}
+"""
+
+
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+
+def callback(user_data, result, error):
+    if error:
+        user_data._completed_requests.put(error)
+    else:
+        user_data._completed_requests.put(result)
+
+
+class IterativeSequenceTest(tu.TestResultCollector):
+    def setUp(self):
+        # Always make sure the original config is used
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.load_model("iterative_sequence")
+
+    def test_generate_stream(self):
+        headers = {"Accept": "text/event-stream"}
+        url = "http://localhost:8000/v2/models/iterative_sequence/generate_stream"
+        inputs = {"INPUT": 2}
+        res = requests.post(url, data=json.dumps(inputs), headers=headers)
+        res.raise_for_status()
+        client = sseclient.SSEClient(res)
+        res_count = 2
+        for event in client.events():
+            res_count -= 1
+            data = json.loads(event.data)
+            self.assertIn("OUTPUT", data)
+            self.assertEqual(res_count, data["OUTPUT"])
+        self.assertEqual(0, res_count)
+
+    def test_grpc_stream(self, sequence_id=0, sequence_start=False):
+        user_data = UserData()
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.start_stream(callback=partial(callback, user_data))
+            inputs = []
+            inputs.append(grpcclient.InferInput("INPUT", [1, 1], "INT32"))
+            inputs[0].set_data_from_numpy(np.array([[2]], dtype=np.int32))
+
+            triton_client.async_stream_infer(
+                model_name="iterative_sequence",
+                inputs=inputs,
+                sequence_id=sequence_id,
+                sequence_start=sequence_start,
+            )
+            res_count = 2
+            while res_count > 0:
+                data_item = user_data._completed_requests.get()
+                res_count -= 1
+                if type(data_item) == InferenceServerException:
+                    raise data_item
+                else:
+                    self.assertEqual(res_count, data_item.as_numpy("OUTPUT")[0][0])
+            self.assertEqual(0, res_count)
+
+    def test_reschedule_error(self):
+        # Use short idle timeout (< backend reschedule delay: 0.5s) so that
+        # the backend won't be able to reschedule the request as the scheduler
+        # will terminate the sequence early
+        config = r'"sequence_batching" : { "iterative_sequence" : true, "max_sequence_idle_microseconds" : 200000 }'
+        with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+            triton_client.load_model(
+                "iterative_sequence", config=MODEL_CONFIG_BASE.format(config)
+            )
+        with self.assertRaises(InferenceServerException) as context:
+            # Without specifying 'iterative_sequence : true', the sequence
+            # batcher expects sequence parameters to be provided explicitly
+            self.test_grpc_stream()
+        print(str(context.exception))
+        self.assertTrue(
+            "must specify the START flag on the first request of the sequence"
+            in str(context.exception)
+        )
+
+    def test_unsupported_sequence_scheduler(self):
+        # Override model config with scheduler settings that do not support
+        # request rescheduling.
+        configs = [
+            r'"sequence_batching" : { "direct" : {}, "iterative_sequence" : false }',
+            r'"sequence_batching" : { "oldest" : {}, "iterative_sequence" : false }',
+        ]
+        sid = 1
+        for sc in configs:
+            with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+                triton_client.load_model(
+                    "iterative_sequence", config=MODEL_CONFIG_BASE.format(sc)
+                )
+            with self.assertRaises(InferenceServerException) as context:
+                # Without specifying 'iterative_sequence : true', the sequence
+                # batcher expects sequence parameters to be provided explicitly
+                self.test_grpc_stream(sequence_id=sid, sequence_start=True)
+            sid += 1
+            self.assertTrue(
+                "Request is released with TRITONSERVER_REQUEST_RELEASE_RESCHEDULE"
+                in str(context.exception)
+            )
+
+    def test_unsupported_dynamic_scheduler(self):
+        # Override model config with scheduler settings that do not support
+        # request rescheduling.
+        configs = [
+            r'"dynamic_batching" : {}',
+        ]
+        for sc in configs:
+            with grpcclient.InferenceServerClient("localhost:8001") as triton_client:
+                triton_client.load_model(
+                    "iterative_sequence", config=MODEL_CONFIG_BASE.format(sc)
+                )
+            with self.assertRaises(InferenceServerException) as context:
+                self.test_grpc_stream()
+            self.assertTrue(
+                "Request is released with TRITONSERVER_REQUEST_RELEASE_RESCHEDULE"
+                in str(context.exception)
+            )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt b/qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt
new file mode 100644
index 0000000000..d6e539007b
--- /dev/null
+++ b/qa/L0_iterative_sequence/models/iterative_sequence/config.pbtxt
@@ -0,0 +1,48 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+backend: "iterative_sequence"
+max_batch_size: 4
+input [
+  {
+    name: "INPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+  }
+]
+model_transaction_policy {
+  decoupled: True
+}
+sequence_batching {
+  iterative_sequence : true
+}
+instance_group [{ kind: KIND_CPU }]
diff --git a/qa/L0_iterative_sequence/test.sh b/qa/L0_iterative_sequence/test.sh
new file mode 100755
index 0000000000..09117ffe93
--- /dev/null
+++ b/qa/L0_iterative_sequence/test.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+source ../common/util.sh
+
+RET=0
+
+CLIENT_LOG="./iterative_sequence_client.log"
+TEST_PY=./iterative_sequence_e2e.py
+EXPECTED_NUM_TESTS="5"
+TEST_RESULT_FILE='test_results.txt'
+
+
+export CUDA_VISIBLE_DEVICES=0
+
+rm -fr *.log
+
+pip install sseclient-py
+
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=EXPLICIT"
+SERVER_LOG="./inference_server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $TEST_PY >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $CLIENT_LOG
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_java_memory_growth/MemoryGrowthTest.java b/qa/L0_java_memory_growth/MemoryGrowthTest.java
index d5a8092872..28243459ec 100644
--- a/qa/L0_java_memory_growth/MemoryGrowthTest.java
+++ b/qa/L0_java_memory_growth/MemoryGrowthTest.java
@@ -1,4 +1,4 @@
-// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -24,880 +24,920 @@
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import static org.bytedeco.tritonserver.global.tritonserver.*;
+
+import com.google.gson.*;
 import java.io.*;
 import java.util.*;
 import java.util.concurrent.*;
-import com.google.gson.*;
 import org.bytedeco.javacpp.*;
 import org.bytedeco.tritonserver.tritonserver.*;
-import static org.bytedeco.tritonserver.global.tritonserver.*;
 
 public class MemoryGrowthTest {
-    static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
-    private static boolean done = false;
-    static float max_growth_allowed = .10f;
-    static int max_mem_allowed = 30;
-
-    static void FAIL(String MSG) {
-        System.err.println("failure: " + MSG);
-        System.exit(1);
-    }
-
-    static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG) {
-        if (err__ != null) {
-            System.err.println("error: " + MSG + ":"
-                             + TRITONSERVER_ErrorCodeString(err__) + " - "
-                             + TRITONSERVER_ErrorMessage(err__));
-            TRITONSERVER_ErrorDelete(err__);
-            System.exit(1);
-        }
+  static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
+  private static boolean done = false;
+  static float max_growth_allowed = .10f;
+  static int max_mem_allowed = 30;
+
+  static void FAIL(String MSG)
+  {
+    System.err.println("failure: " + MSG);
+    System.exit(1);
+  }
+
+  static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG)
+  {
+    if (err__ != null) {
+      System.err.println(
+          "error: " + MSG + ":" + TRITONSERVER_ErrorCodeString(err__) + " - "
+          + TRITONSERVER_ErrorMessage(err__));
+      TRITONSERVER_ErrorDelete(err__);
+      System.exit(1);
     }
+  }
 
-    static boolean enforce_memory_type = false;
-    static int requested_memory_type;
-    // Parameters for percentile range to include (exclude outliers)
-    static final int max_percentile = 90;
-    static final int min_percentile = 10;
+  static boolean enforce_memory_type = false;
+  static int requested_memory_type;
+  // Parameters for percentile range to include (exclude outliers)
+  static final int max_percentile = 90;
+  static final int min_percentile = 10;
 
-    static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
-        public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p) { super(p); deallocator(new DeleteDeallocator(this)); }
-        protected static class DeleteDeallocator extends TRITONSERVER_Server implements Deallocator {
-            DeleteDeallocator(Pointer p) { super(p); }
-            @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
-        }
-    }
-
-    static void
-    Usage(String msg)
+  static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
+    public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p)
     {
-      if (msg != null) {
-        System.err.println(msg);
-      }
-
-      System.err.println("Usage: java " + MemoryGrowthTest.class.getSimpleName() + " [options]");
-      System.err.println("\t-i Set number of iterations");
-      System.err.println("\t-m <\"system\"|\"pinned\"|gpu>"
-                       + " Enforce the memory type for input and output tensors."
-                       + " If not specified, inputs will be in system memory and outputs"
-                       + " will be based on the model's preferred type.");
-      System.err.println("\t-v Enable verbose logging");
-      System.err.println("\t-r [model repository absolute path]");
-      System.err.println("\t--max-growth Specify maximum allowed memory growth (%)");
-      System.err.println("\t--max-memory Specify maximum allowed memory (MB)");
-
-      System.exit(1);
+      super(p);
+      deallocator(new DeleteDeallocator(this));
     }
-
-    static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, String tensor_name,
-            long byte_size, int preferred_memory_type,
-            long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
-            PointerPointer buffer_userp, IntPointer actual_memory_type,
-            LongPointer actual_memory_type_id)
-        {
-          // Initially attempt to make the actual memory type and id that we
-          // allocate be the same as preferred memory type
-          actual_memory_type.put(0, preferred_memory_type);
-          actual_memory_type_id.put(0, preferred_memory_type_id);
-
-          // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
-          // need to do any other book-keeping.
-          if (byte_size == 0) {
-            buffer.put(0, null);
-            buffer_userp.put(0, null);
-          } else {
-            Pointer allocated_ptr = new Pointer();
-            if (enforce_memory_type) {
-              actual_memory_type.put(0, requested_memory_type);
-            }
-
-            actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
-            allocated_ptr = Pointer.malloc(byte_size);
-
-            // Pass the tensor name with buffer_userp so we can show it when
-            // releasing the buffer.
-            if (!allocated_ptr.isNull()) {
-              buffer.put(0, allocated_ptr);
-              buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
-            }
-          }
-
-          return null;  // Success
-        }
+    protected static class DeleteDeallocator
+        extends TRITONSERVER_Server implements Deallocator {
+      DeleteDeallocator(Pointer p) { super(p); }
+      @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
     }
+  }
 
-    static class ResponseRelease extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, Pointer buffer, Pointer buffer_userp,
-            long byte_size, int memory_type, long memory_type_id)
-        {
-          String name = null;
-          if (buffer_userp != null) {
-            name = (String)Loader.accessGlobalRef(buffer_userp);
-          } else {
-            name = "";
-          }
-          Pointer.free(buffer);
-          Loader.deleteGlobalRef(buffer_userp);
-
-          return null;  // Success
-        }
+  static void Usage(String msg)
+  {
+    if (msg != null) {
+      System.err.println(msg);
     }
 
-    static class InferRequestComplete extends TRITONSERVER_InferenceRequestReleaseFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
-        {
-          // We reuse the request so we don't delete it here.
-        }
-    }
-
-    static class InferResponseComplete extends TRITONSERVER_InferenceResponseCompleteFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
-        {
-          if (response != null) {
-            // Send 'response' to the future.
-            futures.get(userp).complete(response);
-          }
+    System.err.println(
+        "Usage: java " + MemoryGrowthTest.class.getSimpleName() + " [options]");
+    System.err.println("\t-i Set number of iterations");
+    System.err.println(
+        "\t-m <\"system\"|\"pinned\"|gpu>"
+        + " Enforce the memory type for input and output tensors."
+        + " If not specified, inputs will be in system memory and outputs"
+        + " will be based on the model's preferred type.");
+    System.err.println("\t-v Enable verbose logging");
+    System.err.println("\t-r [model repository absolute path]");
+    System.err.println(
+        "\t--max-growth Specify maximum allowed memory growth (%)");
+    System.err.println("\t--max-memory Specify maximum allowed memory (MB)");
+
+    System.exit(1);
+  }
+
+  static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, String tensor_name,
+        long byte_size, int preferred_memory_type,
+        long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
+        PointerPointer buffer_userp, IntPointer actual_memory_type,
+        LongPointer actual_memory_type_id)
+    {
+      // Initially attempt to make the actual memory type and id that we
+      // allocate be the same as preferred memory type
+      actual_memory_type.put(0, preferred_memory_type);
+      actual_memory_type_id.put(0, preferred_memory_type_id);
+
+      // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
+      // need to do any other book-keeping.
+      if (byte_size == 0) {
+        buffer.put(0, null);
+        buffer_userp.put(0, null);
+      } else {
+        Pointer allocated_ptr = new Pointer();
+        if (enforce_memory_type) {
+          actual_memory_type.put(0, requested_memory_type);
         }
-    }
 
-    static ConcurrentHashMap> futures = new ConcurrentHashMap<>();
-    static ResponseAlloc responseAlloc = new ResponseAlloc();
-    static ResponseRelease responseRelease = new ResponseRelease();
-    static InferRequestComplete inferRequestComplete = new InferRequestComplete();
-    static InferResponseComplete inferResponseComplete = new InferResponseComplete();
+        actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
+        allocated_ptr = Pointer.malloc(byte_size);
 
-    static TRITONSERVER_Error
-    ParseModelMetadata(
-        JsonObject model_metadata, boolean[] is_int,
-        boolean[] is_torch_model)
-    {
-      String seen_data_type = null;
-      for (JsonElement input_element : model_metadata.get("inputs").getAsJsonArray()) {
-        JsonObject input = input_element.getAsJsonObject();
-        if (!input.get("datatype").getAsString().equals("INT32") &&
-            !input.get("datatype").getAsString().equals("FP32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "simple lib example only supports model with data type INT32 or " +
-              "FP32");
-        }
-        if (seen_data_type == null) {
-          seen_data_type = input.get("datatype").getAsString();
-        } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of 'simple' model must have the data type");
-        }
-      }
-      for (JsonElement output_element : model_metadata.get("outputs").getAsJsonArray()) {
-        JsonObject output = output_element.getAsJsonObject();
-        if (!output.get("datatype").getAsString().equals("INT32") &&
-            !output.get("datatype").getAsString().equals("FP32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "simple lib example only supports model with data type INT32 or " +
-              "FP32");
-        } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of 'simple' model must have the data type");
+        // Pass the tensor name with buffer_userp so we can show it when
+        // releasing the buffer.
+        if (!allocated_ptr.isNull()) {
+          buffer.put(0, allocated_ptr);
+          buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
         }
       }
 
-      is_int[0] = seen_data_type.equals("INT32");
-      is_torch_model[0] =
-          model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
-      return null;
+      return null; // Success
     }
-
-    static void
-    GenerateInputData(
-        IntPointer[] input0_data, IntPointer[] input1_data)
+  }
+
+  static class ResponseRelease
+      extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, Pointer buffer,
+        Pointer buffer_userp, long byte_size, int memory_type,
+        long memory_type_id)
     {
-      input0_data[0] = new IntPointer(16);
-      input1_data[0] = new IntPointer(16);
-      for (int i = 0; i < 16; ++i) {
-        input0_data[0].put(i, i);
-        input1_data[0].put(i, 1);
+      String name = null;
+      if (buffer_userp != null) {
+        name = (String) Loader.accessGlobalRef(buffer_userp);
+      } else {
+        name = "";
       }
+      Pointer.free(buffer);
+      Loader.deleteGlobalRef(buffer_userp);
+
+      return null; // Success
     }
+  }
 
-    static void
-    GenerateInputData(
-        FloatPointer[] input0_data, FloatPointer[] input1_data)
+  static class InferRequestComplete
+      extends TRITONSERVER_InferenceRequestReleaseFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
     {
-      input0_data[0] = new FloatPointer(16);
-      input1_data[0] = new FloatPointer(16);
-      for (int i = 0; i < 16; ++i) {
-        input0_data[0].put(i, i);
-        input1_data[0].put(i, 1);
-      }
+      // We reuse the request so we don't delete it here.
     }
+  }
 
-    static void
-    CompareResult(
-        String output0_name, String output1_name,
-        IntPointer input0, IntPointer input1, IntPointer output0,
-        IntPointer output1)
+  static class InferResponseComplete
+      extends TRITONSERVER_InferenceResponseCompleteFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
     {
-      for (int i = 0; i < 16; ++i) {
-        if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
-          FAIL("incorrect sum in " + output0_name);
-        }
-        if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
-          FAIL("incorrect difference in " + output1_name);
-        }
+      if (response != null) {
+        // Send 'response' to the future.
+        futures.get(userp).complete(response);
+      }
+    }
+  }
+
+  static ConcurrentHashMap<
+      Pointer, CompletableFuture> futures =
+      new ConcurrentHashMap<>();
+  static ResponseAlloc responseAlloc = new ResponseAlloc();
+  static ResponseRelease responseRelease = new ResponseRelease();
+  static InferRequestComplete inferRequestComplete = new InferRequestComplete();
+  static InferResponseComplete inferResponseComplete =
+      new InferResponseComplete();
+
+  static TRITONSERVER_Error ParseModelMetadata(
+      JsonObject model_metadata, boolean[] is_int, boolean[] is_torch_model)
+  {
+    String seen_data_type = null;
+    for (JsonElement input_element :
+         model_metadata.get("inputs").getAsJsonArray()) {
+      JsonObject input = input_element.getAsJsonObject();
+      if (!input.get("datatype").getAsString().equals("INT32")
+          && !input.get("datatype").getAsString().equals("FP32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "simple lib example only supports model with data type INT32 or "
+                + "FP32");
+      }
+      if (seen_data_type == null) {
+        seen_data_type = input.get("datatype").getAsString();
+      } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of 'simple' model must have the data type");
+      }
+    }
+    for (JsonElement output_element :
+         model_metadata.get("outputs").getAsJsonArray()) {
+      JsonObject output = output_element.getAsJsonObject();
+      if (!output.get("datatype").getAsString().equals("INT32")
+          && !output.get("datatype").getAsString().equals("FP32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "simple lib example only supports model with data type INT32 or "
+                + "FP32");
+      } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of 'simple' model must have the data type");
       }
     }
 
-    static void
-    CompareResult(
-        String output0_name, String output1_name,
-        FloatPointer input0, FloatPointer input1, FloatPointer output0,
-        FloatPointer output1)
-    {
-      for (int i = 0; i < 16; ++i) {
-        if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
-          FAIL("incorrect sum in " + output0_name);
-        }
-        if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
-          FAIL("incorrect difference in " + output1_name);
-        }
+    is_int[0] = seen_data_type.equals("INT32");
+    is_torch_model[0] =
+        model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
+    return null;
+  }
+
+  static void GenerateInputData(
+      IntPointer[] input0_data, IntPointer[] input1_data)
+  {
+    input0_data[0] = new IntPointer(16);
+    input1_data[0] = new IntPointer(16);
+    for (int i = 0; i < 16; ++i) {
+      input0_data[0].put(i, i);
+      input1_data[0].put(i, 1);
+    }
+  }
+
+  static void GenerateInputData(
+      FloatPointer[] input0_data, FloatPointer[] input1_data)
+  {
+    input0_data[0] = new FloatPointer(16);
+    input1_data[0] = new FloatPointer(16);
+    for (int i = 0; i < 16; ++i) {
+      input0_data[0].put(i, i);
+      input1_data[0].put(i, 1);
+    }
+  }
+
+  static void CompareResult(
+      String output0_name, String output1_name, IntPointer input0,
+      IntPointer input1, IntPointer output0, IntPointer output1)
+  {
+    for (int i = 0; i < 16; ++i) {
+      if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
+        FAIL("incorrect sum in " + output0_name);
+      }
+      if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
+        FAIL("incorrect difference in " + output1_name);
+      }
+    }
+  }
+
+  static void CompareResult(
+      String output0_name, String output1_name, FloatPointer input0,
+      FloatPointer input1, FloatPointer output0, FloatPointer output1)
+  {
+    for (int i = 0; i < 16; ++i) {
+      if ((input0.get(i) + input1.get(i)) != output0.get(i)) {
+        FAIL("incorrect sum in " + output0_name);
+      }
+      if ((input0.get(i) - input1.get(i)) != output1.get(i)) {
+        FAIL("incorrect difference in " + output1_name);
       }
     }
+  }
+
+  static void Check(
+      TRITONSERVER_InferenceResponse response, Pointer input0_data,
+      Pointer input1_data, String output0, String output1,
+      long expected_byte_size, int expected_datatype, boolean is_int)
+  {
+    HashMap output_data = new HashMap<>();
+
+    int[] output_count = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceResponseOutputCount(response, output_count),
+        "getting number of response outputs");
+    if (output_count[0] != 2) {
+      FAIL("expecting 2 response outputs, got " + output_count[0]);
+    }
 
-    static void
-    Check(
-        TRITONSERVER_InferenceResponse response,
-        Pointer input0_data, Pointer input1_data,
-        String output0, String output1,
-        long expected_byte_size,
-        int expected_datatype, boolean is_int)
-    {
-      HashMap output_data = new HashMap<>();
+    for (int idx = 0; idx < output_count[0]; ++idx) {
+      BytePointer cname = new BytePointer((Pointer) null);
+      IntPointer datatype = new IntPointer(1);
+      LongPointer shape = new LongPointer((Pointer) null);
+      LongPointer dim_count = new LongPointer(1);
+      Pointer base = new Pointer();
+      SizeTPointer byte_size = new SizeTPointer(1);
+      IntPointer memory_type = new IntPointer(1);
+      LongPointer memory_type_id = new LongPointer(1);
+      Pointer userp = new Pointer();
 
-      int[] output_count = {0};
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceResponseOutputCount(response, output_count),
-          "getting number of response outputs");
-      if (output_count[0] != 2) {
-        FAIL("expecting 2 response outputs, got " + output_count[0]);
-      }
-
-      for (int idx = 0; idx < output_count[0]; ++idx) {
-        BytePointer cname = new BytePointer((Pointer)null);
-        IntPointer datatype = new IntPointer(1);
-        LongPointer shape = new LongPointer((Pointer)null);
-        LongPointer dim_count = new LongPointer(1);
-        Pointer base = new Pointer();
-        SizeTPointer byte_size = new SizeTPointer(1);
-        IntPointer memory_type = new IntPointer(1);
-        LongPointer memory_type_id = new LongPointer(1);
-        Pointer userp = new Pointer();
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseOutput(
-                response, idx, cname, datatype, shape, dim_count, base,
-                byte_size, memory_type, memory_type_id, userp),
-            "getting output info");
-
-        if (cname.isNull()) {
-          FAIL("unable to get output name");
-        }
-
-        String name = cname.getString();
-        if ((!name.equals(output0)) && (!name.equals(output1))) {
-          FAIL("unexpected output '" + name + "'");
-        }
-
-        if ((dim_count.get() != 2) || (shape.get(0) != 1) || (shape.get(1) != 16)) {
-          FAIL("unexpected shape for '" + name + "'");
-        }
+          TRITONSERVER_InferenceResponseOutput(
+              response, idx, cname, datatype, shape, dim_count, base, byte_size,
+              memory_type, memory_type_id, userp),
+          "getting output info");
 
-        if (datatype.get() != expected_datatype) {
-          FAIL(
-              "unexpected datatype '" +
-              TRITONSERVER_DataTypeString(datatype.get()) + "' for '" +
-              name + "'");
-        }
-
-        if (byte_size.get() != expected_byte_size) {
-          FAIL(
-              "unexpected byte-size, expected " +
-              expected_byte_size + ", got " +
-              byte_size.get() + " for " + name);
-        }
-
-        if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
-          FAIL(
-              "unexpected memory type, expected to be allocated in " +
-              TRITONSERVER_MemoryTypeString(requested_memory_type) +
-              ", got " + TRITONSERVER_MemoryTypeString(memory_type.get()) +
-              ", id " + memory_type_id.get() + " for " + name);
-        }
+      if (cname.isNull()) {
+        FAIL("unable to get output name");
+      }
 
-        // We make a copy of the data here... which we could avoid for
-        // performance reasons but ok for this simple example.
-        BytePointer odata = new BytePointer(byte_size.get());
-        output_data.put(name, odata);
-        odata.put(base.limit(byte_size.get()));
+      String name = cname.getString();
+      if ((!name.equals(output0)) && (!name.equals(output1))) {
+        FAIL("unexpected output '" + name + "'");
       }
 
-      if (is_int) {
-        CompareResult(
-            output0, output1, new IntPointer(input0_data), new IntPointer(input1_data),
-            new IntPointer(output_data.get(output0)), new IntPointer(output_data.get(output1)));
-      } else {
-        CompareResult(
-            output0, output1, new FloatPointer(input0_data), new FloatPointer(input1_data),
-            new FloatPointer(output_data.get(output0)), new FloatPointer(output_data.get(output1)));
-      }
-    }
-
-    /**
-    Returns whether the memory growth is within the acceptable range
-    @param  max_float_allowed     Maximum allowed memory growth (%)
-    @param  max_mem_allowed       Maximum allowed memory (MB)
-     */
-    static boolean
-    ValidateMemoryGrowth(float max_growth_allowed, int max_mem_allowed){
-      // Allocate list starting capacity to hold up to 24 hours worth of snapshots.
-      List memory_snapshots = new ArrayList(20000);
-      while(!done){
-        try {
-          Thread.sleep(5000);
-        } catch (InterruptedException e){
-          System.out.println("Memory growth validation interrupted.");
-        }
-        System.gc();
-        double snapshot = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
-        memory_snapshots.add(snapshot);
-        System.out.println("Memory allocated (MB):" + snapshot/1E6);
+      if ((dim_count.get() != 2) || (shape.get(0) != 1)
+          || (shape.get(1) != 16)) {
+        FAIL("unexpected shape for '" + name + "'");
       }
-      if(memory_snapshots.size() < 5){
-        System.out.println("Error: Not enough snapshots, found " + memory_snapshots.size()
-        + " snapshots");
-        return false;
+
+      if (datatype.get() != expected_datatype) {
+        FAIL(
+            "unexpected datatype '"
+            + TRITONSERVER_DataTypeString(datatype.get()) + "' for '" + name
+            + "'");
       }
 
-      // Measure memory growth without outliers by taking difference
-      // between 90th percentile and 10th percentile memory usage.
-      final double bytes_in_mb = 1E6;
-      Collections.sort(memory_snapshots);
-      int index_max = ((int) Math.ceil(max_percentile / 100.0 * memory_snapshots.size())) - 1;
-      int index_min = ((int) Math.ceil(min_percentile / 100.0 * memory_snapshots.size())) - 1;
-      double memory_allocation_delta = memory_snapshots.get(index_max) - memory_snapshots.get(index_min);
-      double memory_allocation_delta_mb = memory_allocation_delta / bytes_in_mb;
-      double memory_allocation_delta_percent = memory_allocation_delta / memory_snapshots.get(index_max);
+      if (byte_size.get() != expected_byte_size) {
+        FAIL(
+            "unexpected byte-size, expected " + expected_byte_size + ", got "
+            + byte_size.get() + " for " + name);
+      }
 
-      System.out.println("Change in memory allocation (MB): " +
-          memory_allocation_delta_mb + ", " +
-          (memory_allocation_delta_percent * 100) + "%");
+      if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
+        FAIL(
+            "unexpected memory type, expected to be allocated in "
+            + TRITONSERVER_MemoryTypeString(requested_memory_type) + ", got "
+            + TRITONSERVER_MemoryTypeString(memory_type.get()) + ", id "
+            + memory_type_id.get() + " for " + name);
+      }
 
-      boolean passed = true;
+      // We make a copy of the data here... which we could avoid for
+      // performance reasons but ok for this simple example.
+      BytePointer odata = new BytePointer(byte_size.get());
+      output_data.put(name, odata);
+      odata.put(base.limit(byte_size.get()));
+    }
 
-      if(memory_allocation_delta_percent >= max_growth_allowed){
-        passed = false;
-        System.out.println("Exceeded allowed memory growth (" +
-          (max_growth_allowed * 100) + "%)");
+    if (is_int) {
+      CompareResult(
+          output0, output1, new IntPointer(input0_data),
+          new IntPointer(input1_data), new IntPointer(output_data.get(output0)),
+          new IntPointer(output_data.get(output1)));
+    } else {
+      CompareResult(
+          output0, output1, new FloatPointer(input0_data),
+          new FloatPointer(input1_data),
+          new FloatPointer(output_data.get(output0)),
+          new FloatPointer(output_data.get(output1)));
+    }
+  }
+
+  /**
+  Returns whether the memory growth is within the acceptable range
+  @param  max_float_allowed     Maximum allowed memory growth (%)
+  @param  max_mem_allowed       Maximum allowed memory (MB)
+   */
+  static boolean ValidateMemoryGrowth(
+      float max_growth_allowed, int max_mem_allowed)
+  {
+    // Allocate list starting capacity to hold up to 24 hours worth of
+    // snapshots.
+    List memory_snapshots = new ArrayList(20000);
+    while (!done) {
+      try {
+        Thread.sleep(5000);
       }
-
-      if((memory_snapshots.get(index_max) / bytes_in_mb) >= max_mem_allowed){
-        passed = false;
-        System.out.println("Exceeded allowed memory (" + max_mem_allowed + 
-          "MB), got " + (memory_snapshots.get(index_max) / bytes_in_mb) + "MB");
+      catch (InterruptedException e) {
+        System.out.println("Memory growth validation interrupted.");
       }
-      return passed;
+      System.gc();
+      double snapshot = Runtime.getRuntime().totalMemory()
+          - Runtime.getRuntime().freeMemory();
+      memory_snapshots.add(snapshot);
+      System.out.println("Memory allocated (MB):" + snapshot / 1E6);
+    }
+    if (memory_snapshots.size() < 5) {
+      System.out.println(
+          "Error: Not enough snapshots, found " + memory_snapshots.size()
+          + " snapshots");
+      return false;
     }
 
-    static void
-    RunInference(TRITONSERVER_ServerDeleter server, String model_name, boolean[] is_int, boolean[] is_torch_model, boolean check_accuracy)
-    throws Exception
-    {
-      // Create the allocator that will be used to allocate buffers for
-      // the result tensors.
-      TRITONSERVER_ResponseAllocator allocator = new TRITONSERVER_ResponseAllocator(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorNew(
-              allocator, responseAlloc, responseRelease, null /* start_fn */),
-          "creating response allocator");
-
-      // Inference
-      TRITONSERVER_InferenceRequest irequest = new TRITONSERVER_InferenceRequest(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestNew(
-              irequest, server, model_name, -1 /* model_version */),
-          "creating inference request");
+    // Measure memory growth without outliers by taking difference
+    // between 90th percentile and 10th percentile memory usage.
+    final double bytes_in_mb = 1E6;
+    Collections.sort(memory_snapshots);
+    int index_max =
+        ((int) Math.ceil(max_percentile / 100.0 * memory_snapshots.size())) - 1;
+    int index_min =
+        ((int) Math.ceil(min_percentile / 100.0 * memory_snapshots.size())) - 1;
+    double memory_allocation_delta =
+        memory_snapshots.get(index_max) - memory_snapshots.get(index_min);
+    double memory_allocation_delta_mb = memory_allocation_delta / bytes_in_mb;
+    double memory_allocation_delta_percent =
+        memory_allocation_delta / memory_snapshots.get(index_max);
+
+    System.out.println(
+        "Change in memory allocation (MB): " + memory_allocation_delta_mb + ", "
+        + (memory_allocation_delta_percent * 100) + "%");
+
+    boolean passed = true;
+
+    if (memory_allocation_delta_percent >= max_growth_allowed) {
+      passed = false;
+      System.out.println(
+          "Exceeded allowed memory growth (" + (max_growth_allowed * 100)
+          + "%)");
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
-          "setting ID for the request");
+    if ((memory_snapshots.get(index_max) / bytes_in_mb) >= max_mem_allowed) {
+      passed = false;
+      System.out.println(
+          "Exceeded allowed memory (" + max_mem_allowed + "MB), got "
+          + (memory_snapshots.get(index_max) / bytes_in_mb) + "MB");
+    }
+    return passed;
+  }
+
+  static void RunInference(
+      TRITONSERVER_ServerDeleter server, String model_name, boolean[] is_int,
+      boolean[] is_torch_model, boolean check_accuracy) throws Exception
+  {
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    TRITONSERVER_ResponseAllocator allocator =
+        new TRITONSERVER_ResponseAllocator(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            allocator, responseAlloc, responseRelease, null /* start_fn */),
+        "creating response allocator");
+
+    // Inference
+    TRITONSERVER_InferenceRequest irequest =
+        new TRITONSERVER_InferenceRequest(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestNew(
+            irequest, server, model_name, -1 /* model_version */),
+        "creating inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
+        "setting ID for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, inferRequestComplete, null /* request_release_userp */),
+        "setting request release callback");
+
+    // Inputs
+    String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT0";
+    String input1 = is_torch_model[0] ? "INPUT__1" : "INPUT1";
+
+    long[] input0_shape = {1, 16};
+    long[] input1_shape = {1, 16};
+
+    int datatype =
+        (is_int[0]) ? TRITONSERVER_TYPE_INT32 : TRITONSERVER_TYPE_FP32;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input0, datatype, input0_shape, input0_shape.length),
+        "setting input 0 meta-data for the request");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input1, datatype, input1_shape, input1_shape.length),
+        "setting input 1 meta-data for the request");
+
+    String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT0";
+    String output1 = is_torch_model[0] ? "OUTPUT__1" : "OUTPUT1";
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
+        "requesting output 0 for the request");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output1),
+        "requesting output 1 for the request");
+
+    // Create the data for the two input tensors. Initialize the first
+    // to unique values and the second to all ones.
+    BytePointer input0_data;
+    BytePointer input1_data;
+    if (is_int[0]) {
+      IntPointer[] p0 = {null}, p1 = {null};
+      GenerateInputData(p0, p1);
+      input0_data = p0[0].getPointer(BytePointer.class);
+      input1_data = p1[0].getPointer(BytePointer.class);
+    } else {
+      FloatPointer[] p0 = {null}, p1 = {null};
+      GenerateInputData(p0, p1);
+      input0_data = p0[0].getPointer(BytePointer.class);
+      input1_data = p1[0].getPointer(BytePointer.class);
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetReleaseCallback(
-              irequest, inferRequestComplete, null /* request_release_userp */),
-          "setting request release callback");
+    long input0_size = input0_data.limit();
+    long input1_size = input1_data.limit();
 
-      // Inputs
-      String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT0";
-      String input1 = is_torch_model[0] ? "INPUT__1" : "INPUT1";
+    Pointer input0_base = input0_data;
+    Pointer input1_base = input1_data;
 
-      long[] input0_shape = {1, 16};
-      long[] input1_shape = {1, 16};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input0, input0_base, input0_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT0 data");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input1, input1_base, input1_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT1 data");
 
-      int datatype =
-          (is_int[0]) ? TRITONSERVER_TYPE_INT32 : TRITONSERVER_TYPE_FP32;
+    // Perform inference...
+    {
+      CompletableFuture completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input0, datatype, input0_shape, input0_shape.length),
-          "setting input 0 meta-data for the request");
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
+
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input1, datatype, input1_shape, input1_shape.length),
-          "setting input 1 meta-data for the request");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
 
-      String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT0";
-      String output1 = is_torch_model[0] ? "OUTPUT__1" : "OUTPUT1";
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
-          "requesting output 0 for the request");
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+      if (check_accuracy) {
+        Check(
+            completed_response, input0_data, input1_data, output0, output1,
+            input0_size, datatype, is_int[0]);
+      }
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output1),
-          "requesting output 1 for the request");
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
 
-      // Create the data for the two input tensors. Initialize the first
-      // to unique values and the second to all ones.
-      BytePointer input0_data;
-      BytePointer input1_data;
+    // Modify some input data in place and then reuse the request
+    // object. For simplicity we only do this when the input tensors are
+    // in non-pinned system memory.
+    if (!enforce_memory_type
+        || (requested_memory_type == TRITONSERVER_MEMORY_CPU)) {
       if (is_int[0]) {
-        IntPointer[] p0 = {null}, p1 = {null};
-        GenerateInputData(p0, p1);
-        input0_data = p0[0].getPointer(BytePointer.class);
-        input1_data = p1[0].getPointer(BytePointer.class);
+        new IntPointer(input0_data).put(0, 27);
       } else {
-        FloatPointer[] p0 = {null}, p1 = {null};
-        GenerateInputData(p0, p1);
-        input0_data = p0[0].getPointer(BytePointer.class);
-        input1_data = p1[0].getPointer(BytePointer.class);
+        new FloatPointer(input0_data).put(0, 27.0f);
       }
 
-      long input0_size = input0_data.limit();
-      long input1_size = input1_data.limit();
+      CompletableFuture completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
-      Pointer input0_base = input0_data;
-      Pointer input1_base = input1_data;
+      // Using a new promise so have to re-register the callback to set
+      // the promise as the userp.
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAppendInputData(
-              irequest, input0, input0_base, input0_size, requested_memory_type,
-              0 /* memory_type_id */),
-          "assigning INPUT0 data");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
+
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+      if (check_accuracy) {
+        Check(
+            completed_response, input0_data, input1_data, output0, output1,
+            input0_size, datatype, is_int[0]);
+      }
+
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
+
+    // Remove input data and then add back different data.
+    {
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceRequestRemoveAllInputData(irequest, input0),
+          "removing INPUT0 data");
       FAIL_IF_ERR(
           TRITONSERVER_InferenceRequestAppendInputData(
-              irequest, input1, input1_base, input1_size, requested_memory_type,
+              irequest, input0, input1_base, input1_size, requested_memory_type,
               0 /* memory_type_id */),
-          "assigning INPUT1 data");
-
-      // Perform inference...
-      {
-        CompletableFuture completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-        if (check_accuracy) {
-          Check(
-              completed_response, input0_data, input1_data, output0, output1,
-              input0_size, datatype, is_int[0]);
-        }
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
-
-      // Modify some input data in place and then reuse the request
-      // object. For simplicity we only do this when the input tensors are
-      // in non-pinned system memory.
-      if (!enforce_memory_type ||
-          (requested_memory_type == TRITONSERVER_MEMORY_CPU)) {
-        if (is_int[0]) {
-          new IntPointer(input0_data).put(0, 27);
-        } else {
-          new FloatPointer(input0_data).put(0, 27.0f);
-        }
+          "assigning INPUT1 data to INPUT0");
 
-        CompletableFuture completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        // Using a new promise so have to re-register the callback to set
-        // the promise as the userp.
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-        if (check_accuracy) {
-          Check(
-              completed_response, input0_data, input1_data, output0, output1,
-              input0_size, datatype, is_int[0]);
-        }
+      CompletableFuture completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
-
-      // Remove input data and then add back different data.
-      {
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestRemoveAllInputData(irequest, input0),
-            "removing INPUT0 data");
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestAppendInputData(
-                irequest, input0, input1_base, input1_size, requested_memory_type,
-                0 /* memory_type_id */),
-            "assigning INPUT1 data to INPUT0");
-
-        CompletableFuture completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        // Using a new promise so have to re-register the callback to set
-        // the promise as the userp.
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-
-        if (check_accuracy) {
-          // Both inputs are using input1_data...
-          Check(
-              completed_response, input1_data, input1_data, output0, output1,
-              input0_size, datatype, is_int[0]);
-        }
+      // Using a new promise so have to re-register the callback to set
+      // the promise as the userp.
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
 
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting inference request");
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+
+      if (check_accuracy) {
+        // Both inputs are using input1_data...
+        Check(
+            completed_response, input1_data, input1_data, output0, output1,
+            input0_size, datatype, is_int[0]);
+      }
 
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorDelete(allocator),
-          "deleting response allocator");
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
     }
 
-    public static void
-    main(String[] args) throws Exception
-    {
-      int num_iterations = 1000000;
-      String model_repository_path = null;
-      int verbose_level = 0;
-      boolean check_accuracy = false;
-
-      // Parse commandline...
-      for (int i = 0; i < args.length; i++) {
-        switch (args[i]) {
-          case "-i":
-            i++;
-            try {
-              num_iterations = Integer.parseInt(args[i]);
-            } catch (NumberFormatException e){
-              Usage(
-                  "-i must be used to specify number of iterations");
-            }
-            break;
-          case "-m":
-            enforce_memory_type = true;
-            i++;
-            if (args[i].equals("system")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU;
-            } else if (args[i].equals("pinned")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
-            } else if (args[i].equals("gpu")) {
-              requested_memory_type = TRITONSERVER_MEMORY_GPU;
-            } else {
-              Usage(
-                  "-m must be used to specify one of the following types:" +
-                  " <\"system\"|\"pinned\"|gpu>");
-            }
-            break;
-          case "-r":
-            model_repository_path = args[++i];
-            break;
-          case "-v":
-            verbose_level = 1;
-            break;
-          case "-c":
-            check_accuracy = true;
-            break;
-          case "-?":
-            Usage(null);
-            break;
-          case "--max-growth":
-            i++;
-            try {
-              max_growth_allowed = Integer.parseInt(args[i]) / 100.0f;
-            } catch (NumberFormatException e){
-              Usage(
-                  "--max-growth must be an integer value specifying allowed memory growth (%)");
-            }
-            break;
-          case "--max-memory":
-            i++;
-            try {
-              max_mem_allowed = Integer.parseInt(args[i]);
-            } catch (NumberFormatException e){
-              Usage(
-                  "--max-memory must be an integer value specifying maximum allowed memory (MB)");
-            }
-            break;
-        }
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestDelete(irequest),
+        "deleting inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator),
+        "deleting response allocator");
+  }
+
+  public static void main(String[] args) throws Exception
+  {
+    int num_iterations = 1000000;
+    String model_repository_path = null;
+    int verbose_level = 0;
+    boolean check_accuracy = false;
+
+    // Parse commandline...
+    for (int i = 0; i < args.length; i++) {
+      switch (args[i]) {
+        case "-i":
+          i++;
+          try {
+            num_iterations = Integer.parseInt(args[i]);
+          }
+          catch (NumberFormatException e) {
+            Usage("-i must be used to specify number of iterations");
+          }
+          break;
+        case "-m":
+          enforce_memory_type = true;
+          i++;
+          if (args[i].equals("system")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU;
+          } else if (args[i].equals("pinned")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
+          } else if (args[i].equals("gpu")) {
+            requested_memory_type = TRITONSERVER_MEMORY_GPU;
+          } else {
+            Usage(
+                "-m must be used to specify one of the following types:"
+                + " <\"system\"|\"pinned\"|gpu>");
+          }
+          break;
+        case "-r":
+          model_repository_path = args[++i];
+          break;
+        case "-v":
+          verbose_level = 1;
+          break;
+        case "-c":
+          check_accuracy = true;
+          break;
+        case "-?":
+          Usage(null);
+          break;
+        case "--max-growth":
+          i++;
+          try {
+            max_growth_allowed = Integer.parseInt(args[i]) / 100.0f;
+          }
+          catch (NumberFormatException e) {
+            Usage(
+                "--max-growth must be an integer value specifying allowed memory growth (%)");
+          }
+          break;
+        case "--max-memory":
+          i++;
+          try {
+            max_mem_allowed = Integer.parseInt(args[i]);
+          }
+          catch (NumberFormatException e) {
+            Usage(
+                "--max-memory must be an integer value specifying maximum allowed memory (MB)");
+          }
+          break;
       }
+    }
 
-      if (model_repository_path == null) {
-        Usage("-r must be used to specify model repository path");
-      }
-      if (enforce_memory_type && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
-        Usage("-m can only be set to \"system\" without enabling GPU");
-      }
+    if (model_repository_path == null) {
+      Usage("-r must be used to specify model repository path");
+    }
+    if (enforce_memory_type
+        && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
+      Usage("-m can only be set to \"system\" without enabling GPU");
+    }
 
-      // Check API version.
-      int[] api_version_major = {0}, api_version_minor = {0};
-      FAIL_IF_ERR(
-          TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
-          "getting Triton API version");
-      if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0]) ||
-          (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
-        FAIL("triton server API version mismatch");
-      }
+    // Check API version.
+    int[] api_version_major = {0}, api_version_minor = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
+        "getting Triton API version");
+    if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0])
+        || (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
+      FAIL("triton server API version mismatch");
+    }
 
-      // Create the server...
-      TRITONSERVER_ServerOptions server_options = new TRITONSERVER_ServerOptions(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsNew(server_options),
-          "creating server options");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetModelRepositoryPath(
-              server_options, model_repository_path),
-          "setting model repository path");
+    // Create the server...
+    TRITONSERVER_ServerOptions server_options =
+        new TRITONSERVER_ServerOptions(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsNew(server_options),
+        "creating server options");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
+            server_options, model_repository_path),
+        "setting model repository path");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
+        "setting verbose logging level");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetBackendDirectory(
+            server_options, "/opt/tritonserver/backends"),
+        "setting backend directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
+            server_options, "/opt/tritonserver/repoagents"),
+        "setting repository agent directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
+        "setting strict model configuration");
+    double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
+            server_options, min_compute_capability),
+        "setting minimum supported CUDA compute capability");
+
+    TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsDelete(server_options),
+        "deleting server options");
+
+    TRITONSERVER_ServerDeleter server =
+        new TRITONSERVER_ServerDeleter(server_ptr);
+
+    // Wait until the server is both live and ready.
+    int health_iters = 0;
+    while (true) {
+      boolean[] live = {false}, ready = {false};
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
-          "setting verbose logging level");
+          TRITONSERVER_ServerIsLive(server, live),
+          "unable to get server liveness");
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetBackendDirectory(
-              server_options, "/opt/tritonserver/backends"),
-          "setting backend directory");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
-              server_options, "/opt/tritonserver/repoagents"),
-          "setting repository agent directory");
+          TRITONSERVER_ServerIsReady(server, ready),
+          "unable to get server readiness");
+      System.out.println(
+          "Server Health: live " + live[0] + ", ready " + ready[0]);
+      if (live[0] && ready[0]) {
+        break;
+      }
+
+      if (++health_iters >= 10) {
+        FAIL("failed to find healthy inference server");
+      }
+
+      Thread.sleep(500);
+    }
+
+    // Print status of the server.
+    {
+      TRITONSERVER_Message server_metadata_message =
+          new TRITONSERVER_Message(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
-          "setting strict model configuration");
-      double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
+          TRITONSERVER_ServerMetadata(server, server_metadata_message),
+          "unable to get server metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
-              server_options, min_compute_capability),
-          "setting minimum supported CUDA compute capability");
+          TRITONSERVER_MessageSerializeToJson(
+              server_metadata_message, buffer, byte_size),
+          "unable to serialize server metadata message");
+
+      System.out.println("Server Status:");
+      System.out.println(buffer.limit(byte_size.get()).getString());
 
-      TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsDelete(server_options),
-          "deleting server options");
-
-      TRITONSERVER_ServerDeleter server = new TRITONSERVER_ServerDeleter(server_ptr);
-
-      // Wait until the server is both live and ready.
-      int health_iters = 0;
-      while (true) {
-        boolean[] live = {false}, ready = {false};
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsLive(server, live),
-            "unable to get server liveness");
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsReady(server, ready),
-            "unable to get server readiness");
-        System.out.println("Server Health: live " + live[0] + ", ready " + ready[0]);
-        if (live[0] && ready[0]) {
-          break;
-        }
+          TRITONSERVER_MessageDelete(server_metadata_message),
+          "deleting status metadata");
+    }
+
+    String model_name = "simple";
 
+    // Wait for the model to become available.
+    boolean[] is_torch_model = {false};
+    boolean[] is_int = {true};
+    boolean[] is_ready = {false};
+    health_iters = 0;
+    while (!is_ready[0]) {
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelIsReady(server, model_name, 1, is_ready),
+          "unable to get model readiness");
+      if (!is_ready[0]) {
         if (++health_iters >= 10) {
-          FAIL("failed to find healthy inference server");
+          FAIL("model failed to be ready in 10 iterations");
         }
-
         Thread.sleep(500);
+        continue;
       }
 
-      // Print status of the server.
-      {
-        TRITONSERVER_Message server_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerMetadata(server, server_metadata_message),
-            "unable to get server metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                server_metadata_message, buffer, byte_size),
-            "unable to serialize server metadata message");
-
-        System.out.println("Server Status:");
-        System.out.println(buffer.limit(byte_size.get()).getString());
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(server_metadata_message),
-            "deleting status metadata");
-      }
-
-      String model_name = "simple";
-
-      // Wait for the model to become available.
-      boolean[] is_torch_model = {false};
-      boolean[] is_int = {true};
-      boolean[] is_ready = {false};
-      health_iters = 0;
-      while (!is_ready[0]) {
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelIsReady(
-                server, model_name, 1, is_ready),
-            "unable to get model readiness");
-        if (!is_ready[0]) {
-          if (++health_iters >= 10) {
-            FAIL("model failed to be ready in 10 iterations");
-          }
-          Thread.sleep(500);
-          continue;
-        }
-
-        TRITONSERVER_Message model_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelMetadata(
-                server, model_name, 1, model_metadata_message),
-            "unable to get model metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                model_metadata_message, buffer, byte_size),
-            "unable to serialize model status protobuf");
-
-        JsonParser parser = new JsonParser();
-        JsonObject model_metadata = null;
-        try {
-          model_metadata = parser.parse(buffer.limit(byte_size.get()).getString()).getAsJsonObject();
-        } catch (Exception e) {
-          FAIL("error: failed to parse model metadata from JSON: " + e);
-        }
+      TRITONSERVER_Message model_metadata_message =
+          new TRITONSERVER_Message(null);
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelMetadata(
+              server, model_name, 1, model_metadata_message),
+          "unable to get model metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageSerializeToJson(
+              model_metadata_message, buffer, byte_size),
+          "unable to serialize model status protobuf");
+
+      JsonParser parser = new JsonParser();
+      JsonObject model_metadata = null;
+      try {
+        model_metadata = parser.parse(buffer.limit(byte_size.get()).getString())
+                             .getAsJsonObject();
+      }
+      catch (Exception e) {
+        FAIL("error: failed to parse model metadata from JSON: " + e);
+      }
 
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(model_metadata_message),
-            "deleting status protobuf");
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageDelete(model_metadata_message),
+          "deleting status protobuf");
 
-        if (!model_metadata.get("name").getAsString().equals(model_name)) {
-          FAIL("unable to find metadata for model");
-        }
+      if (!model_metadata.get("name").getAsString().equals(model_name)) {
+        FAIL("unable to find metadata for model");
+      }
 
-        boolean found_version = false;
-        if (model_metadata.has("versions")) {
-          for (JsonElement version : model_metadata.get("versions").getAsJsonArray()) {
-            if (version.getAsString().equals("1")) {
-              found_version = true;
-              break;
-            }
+      boolean found_version = false;
+      if (model_metadata.has("versions")) {
+        for (JsonElement version :
+             model_metadata.get("versions").getAsJsonArray()) {
+          if (version.getAsString().equals("1")) {
+            found_version = true;
+            break;
           }
         }
-        if (!found_version) {
-          FAIL("unable to find version 1 status for model");
-        }
-
-        FAIL_IF_ERR(
-            ParseModelMetadata(model_metadata, is_int, is_torch_model),
-            "parsing model metadata");
+      }
+      if (!found_version) {
+        FAIL("unable to find version 1 status for model");
       }
 
-      Runnable runnable =
-        () -> {
-          boolean passed = ValidateMemoryGrowth(max_growth_allowed, max_mem_allowed);
-          
-          // Sleep to give the garbage collector time to free the server.
-          // This avoids race conditions between Triton bindings' printing and
-          // Java's native printing below.
-          try {
-            Thread.sleep(5000);
-          } catch (InterruptedException e){
-            System.out.println("Sleep interrupted: " + e.toString());
-          }
-
-          if(passed){
-            System.out.println("Memory growth test passed");
-          } else {
-            System.out.println("Memory growth test FAILED");
-          }
-        };
-      Thread memory_thread = new Thread(runnable);
-      memory_thread.start();
+      FAIL_IF_ERR(
+          ParseModelMetadata(model_metadata, is_int, is_torch_model),
+          "parsing model metadata");
+    }
 
-      for(int i = 0; i < num_iterations; i++){
-        try (PointerScope scope = new PointerScope()) {
-          RunInference(server, model_name, is_int, is_torch_model, check_accuracy);
-        }
+    Runnable runnable = () ->
+    {
+      boolean passed =
+          ValidateMemoryGrowth(max_growth_allowed, max_mem_allowed);
+
+      // Sleep to give the garbage collector time to free the server.
+      // This avoids race conditions between Triton bindings' printing and
+      // Java's native printing below.
+      try {
+        Thread.sleep(5000);
+      }
+      catch (InterruptedException e) {
+        System.out.println("Sleep interrupted: " + e.toString());
       }
-      done = true;
-      memory_thread.join();
 
-      System.exit(0);
+      if (passed) {
+        System.out.println("Memory growth test passed");
+      } else {
+        System.out.println("Memory growth test FAILED");
+      }
+    };
+    Thread memory_thread = new Thread(runnable);
+    memory_thread.start();
+
+    for (int i = 0; i < num_iterations; i++) {
+      try (PointerScope scope = new PointerScope()) {
+        RunInference(
+            server, model_name, is_int, is_torch_model, check_accuracy);
+      }
     }
+    done = true;
+    memory_thread.join();
+
+    System.exit(0);
+  }
 }
diff --git a/qa/L0_java_memory_growth/test.sh b/qa/L0_java_memory_growth/test.sh
index 60bacb9b94..d5ec33a5d5 100755
--- a/qa/L0_java_memory_growth/test.sh
+++ b/qa/L0_java_memory_growth/test.sh
@@ -27,14 +27,12 @@
 
 # Set up test files based on installation instructions
 # https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/README.md
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 export MAVEN_OPTS="-XX:MaxGCPauseMillis=40"
 MODEL_REPO=`pwd`/models
@@ -76,12 +74,12 @@ fi
 LOG_IDX=$((LOG_IDX+1))
 CLIENT_LOG="./client_$LOG_IDX.log"
 
-# Longer-running memory growth test 
+# Longer-running memory growth test
 ITERS=1000000
 MAX_MEM_GROWTH_MB=10
 if [ "$TRITON_PERF_LONG" == 1 ]; then
     # ~1 day
-    ITERS=125000000
+    ITERS=150000000
     MAX_MEM_GROWTH_MB=25
 fi
 
diff --git a/qa/L0_java_resnet/ResnetTest.java b/qa/L0_java_resnet/ResnetTest.java
index 9bf46b22f7..4827273926 100644
--- a/qa/L0_java_resnet/ResnetTest.java
+++ b/qa/L0_java_resnet/ResnetTest.java
@@ -1,4 +1,4 @@
-// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -24,593 +24,616 @@
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import static org.bytedeco.tritonserver.global.tritonserver.*;
+
+import com.google.gson.*;
 import java.io.*;
 import java.util.*;
 import java.util.concurrent.*;
-import com.google.gson.*;
 import org.bytedeco.javacpp.*;
 import org.bytedeco.tritonserver.tritonserver.*;
-import static org.bytedeco.tritonserver.global.tritonserver.*;
 
 public class ResnetTest {
-    // Maximum allowed difference from expected model outputs
-    private static final float ALLOWED_DELTA = .001f;
-    private static final String[] MODELS = {
-      "resnet50_fp32_libtorch",
-      "resnet50_fp32_onnx",
+  // Maximum allowed difference from expected model outputs
+  private static final float ALLOWED_DELTA = .001f;
+  private static final String[] MODELS = {
+      "resnet50_fp32_libtorch", "resnet50_fp32_onnx",
       // TODO: fix build to support GPU only resnet50v1.5_fp16_savedmodel
       //"resnet50v1.5_fp16_savedmodel",
-      };
-    private static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
-    private enum Backend {
-      NONE,
-      ONNX,
-      TF,
-      TORCH,
+  };
+  private static final double TRITON_MIN_COMPUTE_CAPABILITY = 6.0;
+  private enum Backend {
+    NONE,
+    ONNX,
+    TF,
+    TORCH,
+  }
+
+  static void FAIL(String MSG)
+  {
+    System.err.println("failure: " + MSG);
+    System.exit(1);
+  }
+
+  static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG)
+  {
+    if (err__ != null) {
+      System.err.println(
+          "error: " + MSG + ":" + TRITONSERVER_ErrorCodeString(err__) + " - "
+          + TRITONSERVER_ErrorMessage(err__));
+      TRITONSERVER_ErrorDelete(err__);
+      System.exit(1);
     }
+  }
 
-    static void FAIL(String MSG) {
-        System.err.println("failure: " + MSG);
-        System.exit(1);
-    }
+  static boolean enforce_memory_type = false;
+  static int requested_memory_type;
 
-    static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG) {
-        if (err__ != null) {
-            System.err.println("error: " + MSG + ":"
-                             + TRITONSERVER_ErrorCodeString(err__) + " - "
-                             + TRITONSERVER_ErrorMessage(err__));
-            TRITONSERVER_ErrorDelete(err__);
-            System.exit(1);
-        }
+  static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
+    public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p)
+    {
+      super(p);
+      deallocator(new DeleteDeallocator(this));
     }
+    protected static class DeleteDeallocator
+        extends TRITONSERVER_Server implements Deallocator {
+      DeleteDeallocator(Pointer p) { super(p); }
+      @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
+    }
+  }
 
-    static boolean enforce_memory_type = false;
-    static int requested_memory_type;
-
-    static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
-        public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p) { super(p); deallocator(new DeleteDeallocator(this)); }
-        protected static class DeleteDeallocator extends TRITONSERVER_Server implements Deallocator {
-            DeleteDeallocator(Pointer p) { super(p); }
-            @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
-        }
+  static void Usage(String msg)
+  {
+    if (msg != null) {
+      System.err.println(msg);
     }
 
-    static void
-    Usage(String msg)
+    System.err.println(
+        "Usage: java " + ResnetTest.class.getSimpleName() + " [options]");
+    System.err.println(
+        "\t-m <\"system\"|\"pinned\"|gpu>"
+        + " Enforce the memory type for input and output tensors."
+        + " If not specified, inputs will be in system memory and outputs"
+        + " will be based on the model's preferred type.");
+    System.err.println("\t-v Enable verbose logging");
+    System.err.println("\t-r [model repository absolute path]");
+
+    System.exit(1);
+  }
+
+  static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, String tensor_name,
+        long byte_size, int preferred_memory_type,
+        long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
+        PointerPointer buffer_userp, IntPointer actual_memory_type,
+        LongPointer actual_memory_type_id)
     {
-      if (msg != null) {
-        System.err.println(msg);
-      }
+      // Initially attempt to make the actual memory type and id that we
+      // allocate be the same as preferred memory type
+      actual_memory_type.put(0, preferred_memory_type);
+      actual_memory_type_id.put(0, preferred_memory_type_id);
+
+      // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
+      // need to do any other book-keeping.
+      if (byte_size == 0) {
+        buffer.put(0, null);
+        buffer_userp.put(0, null);
+        System.out.println(
+            "allocated " + byte_size + " bytes for result tensor "
+            + tensor_name);
+      } else {
+        Pointer allocated_ptr = new Pointer();
+        if (enforce_memory_type) {
+          actual_memory_type.put(0, requested_memory_type);
+        }
 
-      System.err.println("Usage: java " + ResnetTest.class.getSimpleName() + " [options]");
-      System.err.println("\t-m <\"system\"|\"pinned\"|gpu>"
-                       + " Enforce the memory type for input and output tensors."
-                       + " If not specified, inputs will be in system memory and outputs"
-                       + " will be based on the model's preferred type.");
-      System.err.println("\t-v Enable verbose logging");
-      System.err.println("\t-r [model repository absolute path]");
+        actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
+        allocated_ptr = Pointer.malloc(byte_size);
+
+        // Pass the tensor name with buffer_userp so we can show it when
+        // releasing the buffer.
+        if (!allocated_ptr.isNull()) {
+          buffer.put(0, allocated_ptr);
+          buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
+          System.out.println(
+              "allocated " + byte_size + " bytes in "
+              + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
+              + " for result tensor " + tensor_name);
+        }
+      }
 
-      System.exit(1);
+      return null; // Success
     }
+  }
+
+  static class ResponseRelease
+      extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, Pointer buffer,
+        Pointer buffer_userp, long byte_size, int memory_type,
+        long memory_type_id)
+    {
+      String name = null;
+      if (buffer_userp != null) {
+        name = (String) Loader.accessGlobalRef(buffer_userp);
+      } else {
+        name = "";
+      }
 
-    static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, String tensor_name,
-            long byte_size, int preferred_memory_type,
-            long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
-            PointerPointer buffer_userp, IntPointer actual_memory_type,
-            LongPointer actual_memory_type_id)
-        {
-          // Initially attempt to make the actual memory type and id that we
-          // allocate be the same as preferred memory type
-          actual_memory_type.put(0, preferred_memory_type);
-          actual_memory_type_id.put(0, preferred_memory_type_id);
-
-          // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
-          // need to do any other book-keeping.
-          if (byte_size == 0) {
-            buffer.put(0, null);
-            buffer_userp.put(0, null);
-            System.out.println("allocated " + byte_size + " bytes for result tensor " + tensor_name);
-          } else {
-            Pointer allocated_ptr = new Pointer();
-            if (enforce_memory_type) {
-              actual_memory_type.put(0, requested_memory_type);
-            }
-
-            actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
-            allocated_ptr = Pointer.malloc(byte_size);
-
-            // Pass the tensor name with buffer_userp so we can show it when
-            // releasing the buffer.
-            if (!allocated_ptr.isNull()) {
-              buffer.put(0, allocated_ptr);
-              buffer_userp.put(0, Loader.newGlobalRef(tensor_name));
-              System.out.println("allocated " + byte_size + " bytes in "
-                               + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
-                               + " for result tensor " + tensor_name);
-            }
-          }
+      Pointer.free(buffer);
+      Loader.deleteGlobalRef(buffer_userp);
 
-          return null;  // Success
-        }
+      return null; // Success
     }
+  }
 
-    static class ResponseRelease extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, Pointer buffer, Pointer buffer_userp,
-            long byte_size, int memory_type, long memory_type_id)
-        {
-          String name = null;
-          if (buffer_userp != null) {
-            name = (String)Loader.accessGlobalRef(buffer_userp);
-          } else {
-            name = "";
-          }
-          
-          Pointer.free(buffer);
-          Loader.deleteGlobalRef(buffer_userp);
-
-          return null;  // Success
-        }
+  static class InferRequestComplete
+      extends TRITONSERVER_InferenceRequestReleaseFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
+    {
+      // We reuse the request so we don't delete it here.
     }
+  }
 
-    static class InferRequestComplete extends TRITONSERVER_InferenceRequestReleaseFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
-        {
-          // We reuse the request so we don't delete it here.
-        }
+  static class InferResponseComplete
+      extends TRITONSERVER_InferenceResponseCompleteFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
+    {
+      if (response != null) {
+        // Send 'response' to the future.
+        futures.get(userp).complete(response);
+      }
     }
-
-    static class InferResponseComplete extends TRITONSERVER_InferenceResponseCompleteFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
-        {
-          if (response != null) {
-            // Send 'response' to the future.
-            futures.get(userp).complete(response);
-          }
-        }
+  }
+
+  static ConcurrentHashMap<
+      Pointer, CompletableFuture> futures =
+      new ConcurrentHashMap<>();
+  static ResponseAlloc responseAlloc = new ResponseAlloc();
+  static ResponseRelease responseRelease = new ResponseRelease();
+  static InferRequestComplete inferRequestComplete = new InferRequestComplete();
+  static InferResponseComplete inferResponseComplete =
+      new InferResponseComplete();
+
+  static void GenerateInputData(FloatPointer[] input_data)
+  {
+    // Input size is 3 * 224 * 224
+    input_data[0] = new FloatPointer(150528);
+    for (int i = 0; i < 150528; ++i) {
+      input_data[0].put(i, 1);
     }
-
-    static ConcurrentHashMap> futures = new ConcurrentHashMap<>();
-    static ResponseAlloc responseAlloc = new ResponseAlloc();
-    static ResponseRelease responseRelease = new ResponseRelease();
-    static InferRequestComplete inferRequestComplete = new InferRequestComplete();
-    static InferResponseComplete inferResponseComplete = new InferResponseComplete();
-
-    static void
-    GenerateInputData(
-        FloatPointer[] input_data)
-    {
-      // Input size is 3 * 224 * 224
-      input_data[0] = new FloatPointer(150528);
-      for (int i = 0; i < 150528; ++i) {
-        input_data[0].put(i, 1);
+  }
+
+  static boolean AreValidResults(
+      String model_name, FloatPointer output, FloatPointer expected_output)
+  {
+    int output_length = model_name.contains("tensorflow") ? 1001 : 1000;
+    for (int i = 0; i < output_length; ++i) {
+      float difference = output.get(i) - expected_output.get(i);
+      if (difference > ALLOWED_DELTA) {
+        System.out.println(
+            model_name + "inference failure: unexpected output "
+            + "in " + model_name + ", index " + i);
+
+        System.out.println(
+            "Value: " + output.get(i) + ", expected " + expected_output.get(i));
+
+        return false; // Failure
       }
     }
+    return true; // Success
+  }
+
+  static void Check(
+      String model_name, Backend backend,
+      TRITONSERVER_InferenceResponse response, Pointer input_data,
+      String output, int expected_datatype) throws Exception
+  {
+    HashMap output_data = new HashMap<>();
+
+    int[] output_count = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceResponseOutputCount(response, output_count),
+        "getting number of response outputs");
+    if (output_count[0] != 1) {
+      FAIL("expecting 1 response output, got " + output_count[0]);
+    }
 
-    static boolean
-    AreValidResults(
-        String model_name, FloatPointer output, FloatPointer expected_output)
-    {
-      int output_length = model_name.contains("tensorflow") ? 1001 : 1000;
-      for (int i = 0; i < output_length; ++i) {
-        float difference = output.get(i) - expected_output.get(i);
-        if (difference > ALLOWED_DELTA) {
-          System.out.println(model_name + "inference failure: unexpected output " +
-          "in " + model_name + ", index " + i);
+    for (int idx = 0; idx < output_count[0]; ++idx) {
+      BytePointer cname = new BytePointer((Pointer) null);
+      IntPointer datatype = new IntPointer(1);
+      LongPointer shape = new LongPointer((Pointer) null);
+      LongPointer dim_count = new LongPointer(1);
+      Pointer base = new Pointer();
+      SizeTPointer byte_size = new SizeTPointer(1);
+      IntPointer memory_type = new IntPointer(1);
+      LongPointer memory_type_id = new LongPointer(1);
+      Pointer userp = new Pointer();
 
-          System.out.println("Value: " + output.get(i) + ", expected " +
-          expected_output.get(i));
+      FAIL_IF_ERR(
+          TRITONSERVER_InferenceResponseOutput(
+              response, idx, cname, datatype, shape, dim_count, base, byte_size,
+              memory_type, memory_type_id, userp),
+          "getting output info");
 
-          return false; // Failure
-        }
+      if (cname.isNull()) {
+        FAIL("unable to get output name");
       }
-      return true; // Success
-    }
 
-    static void
-    Check(
-        String model_name, Backend backend,
-        TRITONSERVER_InferenceResponse response,
-        Pointer input_data, String output,
-        int expected_datatype) throws Exception
-    {
-      HashMap output_data = new HashMap<>();
-
-      int[] output_count = {0};
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceResponseOutputCount(response, output_count),
-          "getting number of response outputs");
-      if (output_count[0] != 1) {
-        FAIL("expecting 1 response output, got " + output_count[0]);
+      String name = cname.getString();
+      if (!name.equals(output)) {
+        FAIL("unexpected output '" + name + "'");
       }
 
-      for (int idx = 0; idx < output_count[0]; ++idx) {
-        BytePointer cname = new BytePointer((Pointer)null);
-        IntPointer datatype = new IntPointer(1);
-        LongPointer shape = new LongPointer((Pointer)null);
-        LongPointer dim_count = new LongPointer(1);
-        Pointer base = new Pointer();
-        SizeTPointer byte_size = new SizeTPointer(1);
-        IntPointer memory_type = new IntPointer(1);
-        LongPointer memory_type_id = new LongPointer(1);
-        Pointer userp = new Pointer();
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseOutput(
-                response, idx, cname, datatype, shape, dim_count, base,
-                byte_size, memory_type, memory_type_id, userp),
-            "getting output info");
-
-        if (cname.isNull()) {
-          FAIL("unable to get output name");
-        }
+      int output_length = backend == backend.TF ? 1001 : 1000;
 
-        String name = cname.getString();
-        if (!name.equals(output)) {
-          FAIL("unexpected output '" + name + "'");
-        }
+      if ((dim_count.get() != 2) || (shape.get(0) != 1)
+          || shape.get(1) != output_length) {
+        FAIL("unexpected shape for '" + name + "'");
+      }
 
-        int output_length = backend == backend.TF ? 1001: 1000;
+      if (datatype.get() != expected_datatype) {
+        FAIL(
+            "unexpected datatype '"
+            + TRITONSERVER_DataTypeString(datatype.get()) + "' for '" + name
+            + "'");
+      }
 
-        if ((dim_count.get() != 2) || (shape.get(0) != 1)
-        || shape.get(1) != output_length) {
-          FAIL("unexpected shape for '" + name + "'");
-        }
+      if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
+        FAIL(
+            "unexpected memory type, expected to be allocated in "
+            + TRITONSERVER_MemoryTypeString(requested_memory_type) + ", got "
+            + TRITONSERVER_MemoryTypeString(memory_type.get()) + ", id "
+            + memory_type_id.get() + " for " + name);
+      }
 
-        if (datatype.get() != expected_datatype) {
-          FAIL(
-              "unexpected datatype '" +
-              TRITONSERVER_DataTypeString(datatype.get()) + "' for '" +
-              name + "'");
-        }
+      // We make a copy of the data here... which we could avoid for
+      // performance reasons but ok for this simple example.
+      BytePointer odata = new BytePointer(byte_size.get());
+      output_data.put(name, odata);
+      odata.put(base.limit(byte_size.get()));
+    }
 
-        if (enforce_memory_type && (memory_type.get() != requested_memory_type)) {
-          FAIL(
-              "unexpected memory type, expected to be allocated in " +
-              TRITONSERVER_MemoryTypeString(requested_memory_type) +
-              ", got " + TRITONSERVER_MemoryTypeString(memory_type.get()) +
-              ", id " + memory_type_id.get() + " for " + name);
-        }
+    // Expected output for model
+    String file_name = "expected_output_data/expected_output_";
+    switch (backend) {
+      case ONNX:
+        file_name += "onnx";
+        break;
+      case TF:
+        file_name += "tensorflow";
+        break;
+      case TORCH:
+        file_name += "pytorch";
+        break;
+      default:
+        FAIL("Unsupported model type");
+        break;
+    }
+    file_name += ".txt";
 
-        // We make a copy of the data here... which we could avoid for
-        // performance reasons but ok for this simple example.
-        BytePointer odata = new BytePointer(byte_size.get());
-        output_data.put(name, odata);
-        odata.put(base.limit(byte_size.get()));
-      }
+    int output_length = backend == backend.TF ? 1001 : 1000;
+    FloatPointer expected_output = new FloatPointer(output_length);
 
-      // Expected output for model
-      String file_name = "expected_output_data/expected_output_";
-      switch (backend) {
-        case ONNX:
-          file_name += "onnx";
-          break;
-        case TF:
-          file_name += "tensorflow";
-          break;
-        case TORCH:
-          file_name += "pytorch";
-          break;
-        default:
-          FAIL("Unsupported model type");
-          break;
-      }
-      file_name += ".txt";
-      
-      int output_length = backend == backend.TF ? 1001: 1000;
-      FloatPointer expected_output = new FloatPointer(output_length);
-
-      try (Scanner scanner = new Scanner(new File(file_name))) {
-        for (int i = 0; i < output_length; ++i) {
-          expected_output.put(i, scanner.nextFloat());
-        } 
+    try (Scanner scanner = new Scanner(new File(file_name))) {
+      for (int i = 0; i < output_length; ++i) {
+        expected_output.put(i, scanner.nextFloat());
       }
+    }
 
-      boolean correct_results = AreValidResults(
-          model_name, new FloatPointer(output_data.get(output)),
-          expected_output);
+    boolean correct_results = AreValidResults(
+        model_name, new FloatPointer(output_data.get(output)), expected_output);
 
-      if(correct_results){
-        System.out.println(backend.name() + " test PASSED");
-      } else {
-        System.out.println(backend.name() + " test FAILED");
-      }
+    if (correct_results) {
+      System.out.println(backend.name() + " test PASSED");
+    } else {
+      System.out.println(backend.name() + " test FAILED");
     }
+  }
 
-    static void
-    PerformInference(
+  static void PerformInference(
       TRITONSERVER_ServerDeleter server, String model_name) throws Exception
-    {
-      // Get type of model
-      Backend backend = Backend.NONE;
-      if(model_name.contains("onnx")) {
-        backend = Backend.ONNX;
-      } else if (model_name.contains("savedmodel")) {
-        backend = Backend.TF;
-      } else if (model_name.contains("torch")) {
-        backend = Backend.TORCH;
-      } else {
-        FAIL("Supported model types (Onnx, TensorFlow, Torch) " +
-        "cannot be inferred from model name " + model_name);
-      }
+  {
+    // Get type of model
+    Backend backend = Backend.NONE;
+    if (model_name.contains("onnx")) {
+      backend = Backend.ONNX;
+    } else if (model_name.contains("savedmodel")) {
+      backend = Backend.TF;
+    } else if (model_name.contains("torch")) {
+      backend = Backend.TORCH;
+    } else {
+      FAIL(
+          "Supported model types (Onnx, TensorFlow, Torch) "
+          + "cannot be inferred from model name " + model_name);
+    }
 
-      // Wait for the model to become available.
-      boolean[] is_ready = {false};
-      int health_iters = 0;
-      while (!is_ready[0]) {
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelIsReady(
-                server, model_name, 1, is_ready),
-            "unable to get model readiness");
-        if (!is_ready[0]) {
-          if (++health_iters >= 10) {
-            FAIL(model_name + " model failed to be ready in 10 iterations");
-          }
-          Thread.sleep(500);
-          continue;
+    // Wait for the model to become available.
+    boolean[] is_ready = {false};
+    int health_iters = 0;
+    while (!is_ready[0]) {
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelIsReady(server, model_name, 1, is_ready),
+          "unable to get model readiness");
+      if (!is_ready[0]) {
+        if (++health_iters >= 10) {
+          FAIL(model_name + " model failed to be ready in 10 iterations");
         }
+        Thread.sleep(500);
+        continue;
       }
+    }
+
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    TRITONSERVER_ResponseAllocator allocator =
+        new TRITONSERVER_ResponseAllocator(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            allocator, responseAlloc, responseRelease, null /* start_fn */),
+        "creating response allocator");
+
+    // Inference
+    TRITONSERVER_InferenceRequest irequest =
+        new TRITONSERVER_InferenceRequest(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestNew(
+            irequest, server, model_name, -1 /* model_version */),
+        "creating inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
+        "setting ID for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, inferRequestComplete, null /* request_release_userp */),
+        "setting request release callback");
+
+
+    // Model inputs
+    String input = "";
+    String output = "";
+    long[] input_shape = {1, 224, 224, 3};
+
+    switch (backend) {
+      case ONNX:
+        input = "import/input:0";
+        output = "import/resnet_v1_50/predictions/Softmax:0";
+        break;
+      case TF:
+        input = "input";
+        output = "probabilities";
+        break;
+      case TORCH:
+        input = "INPUT__0";
+        input_shape[1] = 3;
+        input_shape[3] = 224;
+        output = "OUTPUT__0";
+        break;
+      default:
+        FAIL("Unsupported model type");
+        break;
+    }
+
+    int datatype = TRITONSERVER_TYPE_FP32;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input, datatype, input_shape, input_shape.length),
+        "setting input 0 meta-data for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output),
+        "requesting output 0 for the request");
+
+    // Create the data for the two input tensors. Initialize the first
+    // to unique values and the second to all ones.
+    BytePointer input_data;
+    FloatPointer[] p0 = {null};
+    GenerateInputData(p0);
+    input_data = p0[0].getPointer(BytePointer.class);
+    long input_size = input_data.limit();
+    Pointer input_base = input_data;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input, input_base, input_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT data");
+
+    // Perform inference...
+    {
+      CompletableFuture completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
-      // Create the allocator that will be used to allocate buffers for
-      // the result tensors.
-      TRITONSERVER_ResponseAllocator allocator = new TRITONSERVER_ResponseAllocator(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorNew(
-              allocator, responseAlloc, responseRelease, null /* start_fn */),
-          "creating response allocator");
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
-      // Inference
-      TRITONSERVER_InferenceRequest irequest = new TRITONSERVER_InferenceRequest(null);
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestNew(
-              irequest, server, model_name, -1 /* model_version */),
-          "creating inference request");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
+
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
-          "setting ID for the request");
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
+
+      Check(
+          model_name, backend, completed_response, input_data, output,
+          datatype);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetReleaseCallback(
-              irequest, inferRequestComplete, null /* request_release_userp */),
-          "setting request release callback");
-
-      
-      // Model inputs
-      String input = "";
-      String output = "";
-      long[] input_shape = {1, 224, 224, 3};
-
-      switch (backend) {
-        case ONNX:
-          input = "import/input:0";
-          output = "import/resnet_v1_50/predictions/Softmax:0";
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestDelete(irequest),
+        "deleting inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator),
+        "deleting response allocator");
+  }
+
+  public static void main(String[] args) throws Exception
+  {
+    String model_repository_path = null;
+    int verbose_level = 0;
+
+    // Parse commandline...
+    for (int i = 0; i < args.length; i++) {
+      switch (args[i]) {
+        case "-m": {
+          enforce_memory_type = true;
+          i++;
+          if (args[i].equals("system")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU;
+          } else if (args[i].equals("pinned")) {
+            requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
+          } else if (args[i].equals("gpu")) {
+            requested_memory_type = TRITONSERVER_MEMORY_GPU;
+          } else {
+            Usage(
+                "-m must be used to specify one of the following types:"
+                + " <\"system\"|\"pinned\"|gpu>");
+          }
           break;
-        case TF:
-          input = "input";
-          output = "probabilities";
+        }
+        case "-r":
+          model_repository_path = args[++i];
           break;
-        case TORCH:
-          input = "INPUT__0";
-          input_shape[1] = 3;
-          input_shape[3] = 224;
-          output = "OUTPUT__0";
+        case "-v":
+          verbose_level = 1;
           break;
-        default:
-          FAIL("Unsupported model type");
+        case "-?":
+          Usage(null);
           break;
       }
+    }
 
-      int datatype = TRITONSERVER_TYPE_FP32;
-
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input, datatype, input_shape, input_shape.length),
-          "setting input 0 meta-data for the request");
-
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output),
-          "requesting output 0 for the request");
-
-      // Create the data for the two input tensors. Initialize the first
-      // to unique values and the second to all ones.
-      BytePointer input_data;
-      FloatPointer[] p0 = {null};
-      GenerateInputData(p0);
-      input_data = p0[0].getPointer(BytePointer.class);
-      long input_size = input_data.limit();
-      Pointer input_base = input_data;
+    if (model_repository_path == null) {
+      Usage("-r must be used to specify model repository path");
+    }
+    if (enforce_memory_type
+        && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
+      Usage("-m can only be set to \"system\" without enabling GPU");
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAppendInputData(
-              irequest, input, input_base, input_size, requested_memory_type,
-              0 /* memory_type_id */),
-          "assigning INPUT data");
-
-      // Perform inference...
-      {
-        CompletableFuture completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-
-        Check(
-            model_name, backend, completed_response, input_data, output, datatype);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
+    // Check API version.
+    int[] api_version_major = {0}, api_version_minor = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
+        "getting Triton API version");
+    if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0])
+        || (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
+      FAIL("triton server API version mismatch");
+    }
 
+    // Create the server...
+    TRITONSERVER_ServerOptions server_options =
+        new TRITONSERVER_ServerOptions(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsNew(server_options),
+        "creating server options");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
+            server_options, model_repository_path),
+        "setting model repository path");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
+        "setting verbose logging level");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetBackendDirectory(
+            server_options, "/opt/tritonserver/backends"),
+        "setting backend directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
+            server_options, "/opt/tritonserver/repoagents"),
+        "setting repository agent directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
+        "setting strict model configuration");
+    double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
+            server_options, min_compute_capability),
+        "setting minimum supported CUDA compute capability");
+
+    TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsDelete(server_options),
+        "deleting server options");
+
+    TRITONSERVER_ServerDeleter server =
+        new TRITONSERVER_ServerDeleter(server_ptr);
+
+    // Wait until the server is both live and ready.
+    int health_iters = 0;
+    while (true) {
+      boolean[] live = {false}, ready = {false};
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting inference request");
-
+          TRITONSERVER_ServerIsLive(server, live),
+          "unable to get server liveness");
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorDelete(allocator),
-          "deleting response allocator");
-    }
-    
-    public static void
-    main(String[] args) throws Exception
-    {
-      String model_repository_path = null;
-      int verbose_level = 0;
-
-      // Parse commandline...
-      for (int i = 0; i < args.length; i++) {
-        switch (args[i]) {
-          case "-m": {
-            enforce_memory_type = true;
-            i++;
-            if (args[i].equals("system")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU;
-            } else if (args[i].equals("pinned")) {
-              requested_memory_type = TRITONSERVER_MEMORY_CPU_PINNED;
-            } else if (args[i].equals("gpu")) {
-              requested_memory_type = TRITONSERVER_MEMORY_GPU;
-            } else {
-              Usage(
-                  "-m must be used to specify one of the following types:" +
-                  " <\"system\"|\"pinned\"|gpu>");
-            }
-            break;
-          }
-          case "-r":
-            model_repository_path = args[++i];
-            break;
-          case "-v":
-            verbose_level = 1;
-            break;
-          case "-?":
-            Usage(null);
-            break;
-        }
-      }
-
-      if (model_repository_path == null) {
-        Usage("-r must be used to specify model repository path");
-      }
-      if (enforce_memory_type && requested_memory_type != TRITONSERVER_MEMORY_CPU) {
-        Usage("-m can only be set to \"system\" without enabling GPU");
+          TRITONSERVER_ServerIsReady(server, ready),
+          "unable to get server readiness");
+      System.out.println(
+          "Server Health: live " + live[0] + ", ready " + ready[0]);
+      if (live[0] && ready[0]) {
+        break;
       }
 
-      // Check API version.
-      int[] api_version_major = {0}, api_version_minor = {0};
-      FAIL_IF_ERR(
-          TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
-          "getting Triton API version");
-      if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0]) ||
-          (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
-        FAIL("triton server API version mismatch");
+      if (++health_iters >= 10) {
+        FAIL("failed to find healthy inference server");
       }
 
-      // Create the server...
-      TRITONSERVER_ServerOptions server_options = new TRITONSERVER_ServerOptions(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsNew(server_options),
-          "creating server options");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetModelRepositoryPath(
-              server_options, model_repository_path),
-          "setting model repository path");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
-          "setting verbose logging level");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetBackendDirectory(
-              server_options, "/opt/tritonserver/backends"),
-          "setting backend directory");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
-              server_options, "/opt/tritonserver/repoagents"),
-          "setting repository agent directory");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
-          "setting strict model configuration");
-      double min_compute_capability = TRITON_MIN_COMPUTE_CAPABILITY;
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability(
-              server_options, min_compute_capability),
-          "setting minimum supported CUDA compute capability");
+      Thread.sleep(500);
+    }
 
-      TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    // Print status of the server.
+    {
+      TRITONSERVER_Message server_metadata_message =
+          new TRITONSERVER_Message(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+          TRITONSERVER_ServerMetadata(server, server_metadata_message),
+          "unable to get server metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsDelete(server_options),
-          "deleting server options");
-
-      TRITONSERVER_ServerDeleter server = new TRITONSERVER_ServerDeleter(server_ptr);
-
-      // Wait until the server is both live and ready.
-      int health_iters = 0;
-      while (true) {
-        boolean[] live = {false}, ready = {false};
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsLive(server, live),
-            "unable to get server liveness");
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsReady(server, ready),
-            "unable to get server readiness");
-        System.out.println("Server Health: live " + live[0] + ", ready " + ready[0]);
-        if (live[0] && ready[0]) {
-          break;
-        }
-
-        if (++health_iters >= 10) {
-          FAIL("failed to find healthy inference server");
-        }
+          TRITONSERVER_MessageSerializeToJson(
+              server_metadata_message, buffer, byte_size),
+          "unable to serialize server metadata message");
 
-        Thread.sleep(500);
-      }
-
-      // Print status of the server.
-      {
-        TRITONSERVER_Message server_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerMetadata(server, server_metadata_message),
-            "unable to get server metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                server_metadata_message, buffer, byte_size),
-            "unable to serialize server metadata message");
-
-        System.out.println("Server Status:");
-        System.out.println(buffer.limit(byte_size.get()).getString());
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(server_metadata_message),
-            "deleting status metadata");
-      }
+      System.out.println("Server Status:");
+      System.out.println(buffer.limit(byte_size.get()).getString());
 
-      for(String model : MODELS) {
-        PerformInference(server, model);
-      }
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageDelete(server_metadata_message),
+          "deleting status metadata");
+    }
 
-      System.exit(0);
+    for (String model : MODELS) {
+      PerformInference(server, model);
     }
+
+    System.exit(0);
+  }
 }
diff --git a/qa/L0_java_resnet/test.sh b/qa/L0_java_resnet/test.sh
index e2f424fd7e..1ca08b4c65 100755
--- a/qa/L0_java_resnet/test.sh
+++ b/qa/L0_java_resnet/test.sh
@@ -41,6 +41,8 @@ fi
 # Models
 DATADIR=/data/inferenceserver/${REPO_VERSION}
 MODEL_REPO=`pwd`/models
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 
 # Create local model repository
 mkdir -p ${MODEL_REPO}
@@ -53,14 +55,10 @@ done
 
 # Set up test files based on installation instructions
 # https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/README.md
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 CLIENT_LOG="client.log"
 SAMPLES_REPO=`pwd`/javacpp-presets/tritonserver/samples/simple
diff --git a/qa/L0_java_sequence_batcher/SequenceTest.java b/qa/L0_java_sequence_batcher/SequenceTest.java
index 3fdc5d63c1..cfce3584de 100644
--- a/qa/L0_java_sequence_batcher/SequenceTest.java
+++ b/qa/L0_java_sequence_batcher/SequenceTest.java
@@ -1,4 +1,4 @@
-// Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 //
 // Redistribution and use in source and binary forms, with or without
 // modification, are permitted provided that the following conditions
@@ -24,615 +24,642 @@
 // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+import static org.bytedeco.tritonserver.global.tritonserver.*;
+
+import com.google.gson.*;
 import java.io.*;
 import java.util.*;
 import java.util.concurrent.*;
-import com.google.gson.*;
 import org.bytedeco.javacpp.*;
 import org.bytedeco.tritonserver.tritonserver.*;
-import static org.bytedeco.tritonserver.global.tritonserver.*;
 
 public class SequenceTest {
-
-    // Boilerplate code for setting up Triton
-    static void FAIL(String MSG) {
-        System.err.println("Failure: " + MSG);
-        System.exit(1);
-    }
-
-    static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG) {
-        if (err__ != null) {
-            System.err.println("error: " + MSG + ":"
-                             + TRITONSERVER_ErrorCodeString(err__) + " - "
-                             + TRITONSERVER_ErrorMessage(err__));
-            TRITONSERVER_ErrorDelete(err__);
-            System.exit(1);
-        }
+  // Boilerplate code for setting up Triton
+  static void FAIL(String MSG)
+  {
+    System.err.println("Failure: " + MSG);
+    System.exit(1);
+  }
+
+  static void FAIL_IF_ERR(TRITONSERVER_Error err__, String MSG)
+  {
+    if (err__ != null) {
+      System.err.println(
+          "error: " + MSG + ":" + TRITONSERVER_ErrorCodeString(err__) + " - "
+          + TRITONSERVER_ErrorMessage(err__));
+      TRITONSERVER_ErrorDelete(err__);
+      System.exit(1);
     }
+  }
 
-    static int requested_memory_type = TRITONSERVER_MEMORY_CPU;
-
-    static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
-        public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p) { super(p); deallocator(new DeleteDeallocator(this)); }
-        protected static class DeleteDeallocator extends TRITONSERVER_Server implements Deallocator {
-            DeleteDeallocator(Pointer p) { super(p); }
-            @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
-        }
-    }
+  static int requested_memory_type = TRITONSERVER_MEMORY_CPU;
 
-    static void
-    Usage(String msg)
+  static class TRITONSERVER_ServerDeleter extends TRITONSERVER_Server {
+    public TRITONSERVER_ServerDeleter(TRITONSERVER_Server p)
     {
-      if (msg != null) {
-        System.err.println(msg);
-      }
-
-      System.err.println("Usage: java " + SequenceTest.class.getSimpleName() + " [options]");
-      System.err.println("\t-m [model name]");
-      System.err.println("\t-v Enable verbose logging");
-      System.err.println("\t-r [model repository absolute path]");
-
-      System.exit(1);
+      super(p);
+      deallocator(new DeleteDeallocator(this));
     }
-
-    static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, String tensor_name,
-            long byte_size, int preferred_memory_type,
-            long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
-            PointerPointer buffer_userp, IntPointer actual_memory_type,
-            LongPointer actual_memory_type_id)
-        {
-          // Initially attempt to make the actual memory type and id that we
-          // allocate be the same as preferred memory type
-          actual_memory_type.put(0, preferred_memory_type);
-          actual_memory_type_id.put(0, preferred_memory_type_id);
-
-          // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
-          // need to do any other book-keeping.
-          if (byte_size == 0) {
-            buffer.put(0, null);
-            buffer_userp.put(0, null);
-            System.out.println("allocated " + byte_size + " bytes for result tensor " + tensor_name);
-          } else {
-            Pointer allocated_ptr = new Pointer();
-            actual_memory_type.put(0, requested_memory_type);
-
-            actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
-            allocated_ptr = Pointer.malloc(byte_size);
-
-            // Pass the tensor name with buffer_userp so we can show it when
-            // releasing the buffer.
-            if (!allocated_ptr.isNull()) {
-              buffer.put(0, allocated_ptr);
-              buffer_userp.put(0, new BytePointer(tensor_name));
-              System.out.println("allocated " + byte_size + " bytes in "
-                               + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
-                               + " for result tensor " + tensor_name);
-            }
-          }
-
-          return null;  // Success
-        }
+    protected static class DeleteDeallocator
+        extends TRITONSERVER_Server implements Deallocator {
+      DeleteDeallocator(Pointer p) { super(p); }
+      @Override public void deallocate() { TRITONSERVER_ServerDelete(this); }
     }
+  }
 
-    static class ResponseRelease extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
-        @Override public TRITONSERVER_Error call (
-            TRITONSERVER_ResponseAllocator allocator, Pointer buffer, Pointer buffer_userp,
-            long byte_size, int memory_type, long memory_type_id)
-        {
-          BytePointer name = null;
-          if (buffer_userp != null) {
-            name = new BytePointer(buffer_userp);
-          } else {
-            name = new BytePointer("");
-          }
-
-          System.out.println("Releasing buffer " + buffer + " of size " + byte_size
-                           + " in " + TRITONSERVER_MemoryTypeString(memory_type)
-                           + " for result '" + name.getString() + "'");
-          Pointer.free(buffer);
-          name.deallocate();
-
-          return null;  // Success
-        }
+  static void Usage(String msg)
+  {
+    if (msg != null) {
+      System.err.println(msg);
     }
 
-    static class InferRequestComplete extends TRITONSERVER_InferenceRequestReleaseFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
-        {
-          // We reuse the request so we don't delete it here.
+    System.err.println(
+        "Usage: java " + SequenceTest.class.getSimpleName() + " [options]");
+    System.err.println("\t-m [model name]");
+    System.err.println("\t-v Enable verbose logging");
+    System.err.println("\t-r [model repository absolute path]");
+
+    System.exit(1);
+  }
+
+  static class ResponseAlloc extends TRITONSERVER_ResponseAllocatorAllocFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, String tensor_name,
+        long byte_size, int preferred_memory_type,
+        long preferred_memory_type_id, Pointer userp, PointerPointer buffer,
+        PointerPointer buffer_userp, IntPointer actual_memory_type,
+        LongPointer actual_memory_type_id)
+    {
+      // Initially attempt to make the actual memory type and id that we
+      // allocate be the same as preferred memory type
+      actual_memory_type.put(0, preferred_memory_type);
+      actual_memory_type_id.put(0, preferred_memory_type_id);
+
+      // If 'byte_size' is zero just return 'buffer' == nullptr, we don't
+      // need to do any other book-keeping.
+      if (byte_size == 0) {
+        buffer.put(0, null);
+        buffer_userp.put(0, null);
+        System.out.println(
+            "allocated " + byte_size + " bytes for result tensor "
+            + tensor_name);
+      } else {
+        Pointer allocated_ptr = new Pointer();
+        actual_memory_type.put(0, requested_memory_type);
+
+        actual_memory_type.put(0, TRITONSERVER_MEMORY_CPU);
+        allocated_ptr = Pointer.malloc(byte_size);
+
+        // Pass the tensor name with buffer_userp so we can show it when
+        // releasing the buffer.
+        if (!allocated_ptr.isNull()) {
+          buffer.put(0, allocated_ptr);
+          buffer_userp.put(0, new BytePointer(tensor_name));
+          System.out.println(
+              "allocated " + byte_size + " bytes in "
+              + TRITONSERVER_MemoryTypeString(actual_memory_type.get())
+              + " for result tensor " + tensor_name);
         }
-    }
+      }
 
-    static class InferResponseComplete extends TRITONSERVER_InferenceResponseCompleteFn_t {
-        @Override public void call (
-            TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
-        {
-          if (response != null) {
-            // Send 'response' to the future.
-            futures.get(userp).complete(response);
-          }
-        }
+      return null; // Success
     }
-
-    static ConcurrentHashMap> futures = new ConcurrentHashMap<>();
-    static ResponseAlloc responseAlloc = new ResponseAlloc();
-    static ResponseRelease responseRelease = new ResponseRelease();
-    static InferRequestComplete inferRequestComplete = new InferRequestComplete();
-    static InferResponseComplete inferResponseComplete = new InferResponseComplete();
-
-    static TRITONSERVER_Error
-    ParseModelMetadata(
-        JsonObject model_metadata,
-        boolean[] is_torch_model)
+  }
+
+  static class ResponseRelease
+      extends TRITONSERVER_ResponseAllocatorReleaseFn_t {
+    @Override
+    public TRITONSERVER_Error call(
+        TRITONSERVER_ResponseAllocator allocator, Pointer buffer,
+        Pointer buffer_userp, long byte_size, int memory_type,
+        long memory_type_id)
     {
-      String seen_data_type = null;
-      for (JsonElement input_element : model_metadata.get("inputs").getAsJsonArray()) {
-        JsonObject input = input_element.getAsJsonObject();
-        if (!input.get("datatype").getAsString().equals("INT32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "sequence qa example only supports model with data type INT32");
-        }
-        if (seen_data_type == null) {
-          seen_data_type = input.get("datatype").getAsString();
-        } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of sequence model must have the data type");
-        }
-      }
-      for (JsonElement output_element : model_metadata.get("outputs").getAsJsonArray()) {
-        JsonObject output = output_element.getAsJsonObject();
-        if (!output.get("datatype").getAsString().equals("INT32")) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_UNSUPPORTED,
-              "sequence qa example only supports model with data type INT32");
-        } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
-          return TRITONSERVER_ErrorNew(
-              TRITONSERVER_ERROR_INVALID_ARG,
-              "the inputs and outputs of sequence' model must have the data type");
-        }
+      BytePointer name = null;
+      if (buffer_userp != null) {
+        name = new BytePointer(buffer_userp);
+      } else {
+        name = new BytePointer("");
       }
 
-      is_torch_model[0] =
-          model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
-      return null;
+      System.out.println(
+          "Releasing buffer " + buffer + " of size " + byte_size + " in "
+          + TRITONSERVER_MemoryTypeString(memory_type) + " for result '"
+          + name.getString() + "'");
+      Pointer.free(buffer);
+      name.deallocate();
+
+      return null; // Success
     }
+  }
 
-    // Custom function to set metadata required for sequence batcher
-    static void
-    SetSequenceMetadata(TRITONSERVER_InferenceRequest irequest, long correlation_id, boolean sequence_start, boolean sequence_end)
+  static class InferRequestComplete
+      extends TRITONSERVER_InferenceRequestReleaseFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceRequest request, int flags, Pointer userp)
     {
+      // We reuse the request so we don't delete it here.
+    }
+  }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetCorrelationId(
-              irequest, correlation_id), "Unable to set correlation ID");
-      int flags = 0;
-      if(sequence_start) {
-        flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
+  static class InferResponseComplete
+      extends TRITONSERVER_InferenceResponseCompleteFn_t {
+    @Override
+    public void call(
+        TRITONSERVER_InferenceResponse response, int flags, Pointer userp)
+    {
+      if (response != null) {
+        // Send 'response' to the future.
+        futures.get(userp).complete(response);
       }
-      if(sequence_end) {
-        flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
+    }
+  }
+
+  static ConcurrentHashMap<
+      Pointer, CompletableFuture> futures =
+      new ConcurrentHashMap<>();
+  static ResponseAlloc responseAlloc = new ResponseAlloc();
+  static ResponseRelease responseRelease = new ResponseRelease();
+  static InferRequestComplete inferRequestComplete = new InferRequestComplete();
+  static InferResponseComplete inferResponseComplete =
+      new InferResponseComplete();
+
+  static TRITONSERVER_Error ParseModelMetadata(
+      JsonObject model_metadata, boolean[] is_torch_model)
+  {
+    String seen_data_type = null;
+    for (JsonElement input_element :
+         model_metadata.get("inputs").getAsJsonArray()) {
+      JsonObject input = input_element.getAsJsonObject();
+      if (!input.get("datatype").getAsString().equals("INT32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "sequence qa example only supports model with data type INT32");
+      }
+      if (seen_data_type == null) {
+        seen_data_type = input.get("datatype").getAsString();
+      } else if (!seen_data_type.equals(input.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of sequence model must have the data type");
+      }
+    }
+    for (JsonElement output_element :
+         model_metadata.get("outputs").getAsJsonArray()) {
+      JsonObject output = output_element.getAsJsonObject();
+      if (!output.get("datatype").getAsString().equals("INT32")) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_UNSUPPORTED,
+            "sequence qa example only supports model with data type INT32");
+      } else if (!seen_data_type.equals(output.get("datatype").getAsString())) {
+        return TRITONSERVER_ErrorNew(
+            TRITONSERVER_ERROR_INVALID_ARG,
+            "the inputs and outputs of sequence' model must have the data type");
       }
-      FAIL_IF_ERR(
-        TRITONSERVER_InferenceRequestSetFlags(
-            irequest, flags), "Unable to set flags");
-
     }
 
-    // Custom function for adjusting sequence batcher
-    // expected results for backends that do not implement
-    // full accumulator
-    static int
-    GetExpectedResult(String model_name, int expected_result, int value, String flag){
-      if((!model_name.contains("nobatch") && !model_name.contains("custom")) ||
-          model_name.contains("graphdef") || model_name.contains("plan") ||
-          model_name.contains("onnx") || model_name.contains("libtorch")){
-            expected_result = value;
-            if(flag != null && flag.contains("start")){
-              expected_result++;
-            }
-        }
-        return expected_result;
+    is_torch_model[0] =
+        model_metadata.get("platform").getAsString().equals("pytorch_libtorch");
+    return null;
+  }
+
+  // Custom function to set metadata required for sequence batcher
+  static void SetSequenceMetadata(
+      TRITONSERVER_InferenceRequest irequest, long correlation_id,
+      boolean sequence_start, boolean sequence_end)
+  {
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetCorrelationId(irequest, correlation_id),
+        "Unable to set correlation ID");
+    int flags = 0;
+    if (sequence_start) {
+      flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_START;
+    }
+    if (sequence_end) {
+      flags += TRITONSERVER_REQUEST_FLAG_SEQUENCE_END;
+    }
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetFlags(irequest, flags),
+        "Unable to set flags");
+  }
+
+  // Custom function for adjusting sequence batcher
+  // expected results for backends that do not implement
+  // full accumulator
+  static int GetExpectedResult(
+      String model_name, int expected_result, int value, String flag)
+  {
+    if ((!model_name.contains("nobatch") && !model_name.contains("custom"))
+        || model_name.contains("graphdef") || model_name.contains("plan")
+        || model_name.contains("onnx") || model_name.contains("libtorch")) {
+      expected_result = value;
+      if (flag != null && flag.contains("start")) {
+        expected_result++;
+      }
+    }
+    return expected_result;
+  }
+
+  // Standard function for checking response parameters,
+  // plus customized check that final sequence result
+  // "out" matches expected result
+  static void Check(
+      String model_name, TRITONSERVER_InferenceResponse response,
+      int input_value, String output0, long expected_byte_size,
+      int expected_datatype, boolean sequence_end, int expected_result)
+  {
+    HashMap output_data = new HashMap<>();
+
+    int[] output_count = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceResponseOutputCount(response, output_count),
+        "getting number of response outputs");
+    if (output_count[0] != 1) {
+      FAIL("expecting 1 response outputs, got " + output_count[0]);
     }
 
-    // Standard function for checking response parameters,
-    // plus customized check that final sequence result
-    // "out" matches expected result
-    static void
-    Check(
-        String model_name,
-        TRITONSERVER_InferenceResponse response,
-        int input_value, String output0,
-        long expected_byte_size, int expected_datatype,
-        boolean sequence_end, int expected_result)
-    {
-      HashMap output_data = new HashMap<>();
+    for (int idx = 0; idx < output_count[0]; ++idx) {
+      BytePointer cname = new BytePointer((Pointer) null);
+      IntPointer datatype = new IntPointer(1);
+      LongPointer shape = new LongPointer((Pointer) null);
+      LongPointer dim_count = new LongPointer(1);
+      Pointer base = new Pointer();
+      SizeTPointer byte_size = new SizeTPointer(1);
+      IntPointer memory_type = new IntPointer(1);
+      LongPointer memory_type_id = new LongPointer(1);
+      Pointer userp = new Pointer();
 
-      int[] output_count = {0};
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceResponseOutputCount(response, output_count),
-          "getting number of response outputs");
-      if (output_count[0] != 1) {
-        FAIL("expecting 1 response outputs, got " + output_count[0]);
+          TRITONSERVER_InferenceResponseOutput(
+              response, idx, cname, datatype, shape, dim_count, base, byte_size,
+              memory_type, memory_type_id, userp),
+          "getting output info");
+
+      if (cname.isNull()) {
+        FAIL("unable to get output name");
       }
 
-      for (int idx = 0; idx < output_count[0]; ++idx) {
-        BytePointer cname = new BytePointer((Pointer)null);
-        IntPointer datatype = new IntPointer(1);
-        LongPointer shape = new LongPointer((Pointer)null);
-        LongPointer dim_count = new LongPointer(1);
-        Pointer base = new Pointer();
-        SizeTPointer byte_size = new SizeTPointer(1);
-        IntPointer memory_type = new IntPointer(1);
-        LongPointer memory_type_id = new LongPointer(1);
-        Pointer userp = new Pointer();
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseOutput(
-                response, idx, cname, datatype, shape, dim_count, base,
-                byte_size, memory_type, memory_type_id, userp),
-            "getting output info");
-
-        if (cname.isNull()) {
-          FAIL("unable to get output name");
-        }
+      String name = cname.getString();
+      if (!name.equals(output0)) {
+        FAIL("unexpected output '" + name + "'");
+      }
 
-        String name = cname.getString();
-        if (!name.equals(output0)) {
-          FAIL("unexpected output '" + name + "'");
-        }
+      if ((dim_count.get() != 1) || (shape.get(0) != 1)) {
+        FAIL("unexpected shape for '" + name + "'");
+      }
 
-        if ((dim_count.get() != 1) || (shape.get(0) != 1)) {
-          FAIL("unexpected shape for '" + name + "'");
-        }
+      if (datatype.get() != expected_datatype) {
+        FAIL(
+            "unexpected datatype '"
+            + TRITONSERVER_DataTypeString(datatype.get()) + "' for '" + name
+            + "'");
+      }
 
-        if (datatype.get() != expected_datatype) {
-          FAIL(
-              "unexpected datatype '" +
-              TRITONSERVER_DataTypeString(datatype.get()) + "' for '" +
-              name + "'");
-        }
+      if (byte_size.get() != expected_byte_size) {
+        FAIL(
+            "unexpected byte-size, expected " + expected_byte_size + ", got "
+            + byte_size.get() + " for " + name);
+      }
 
-        if (byte_size.get() != expected_byte_size) {
-          FAIL(
-              "unexpected byte-size, expected " +
-              expected_byte_size + ", got " +
-              byte_size.get() + " for " + name);
-        }
+      if (memory_type.get() != requested_memory_type) {
+        FAIL(
+            "unexpected memory type, expected to be allocated in "
+            + TRITONSERVER_MemoryTypeString(requested_memory_type) + ", got "
+            + TRITONSERVER_MemoryTypeString(memory_type.get()) + ", id "
+            + memory_type_id.get() + " for " + name);
+      }
 
-        if (memory_type.get() != requested_memory_type) {
-          FAIL(
-              "unexpected memory type, expected to be allocated in " +
-              TRITONSERVER_MemoryTypeString(requested_memory_type) +
-              ", got " + TRITONSERVER_MemoryTypeString(memory_type.get()) +
-              ", id " + memory_type_id.get() + " for " + name);
-        }
+      // We make a copy of the data here... which we could avoid for
+      // performance reasons but ok for this sequence example.
+      BytePointer odata = new BytePointer(byte_size.get());
+      output_data.put(name, odata);
+      System.out.println(name + " is stored in system memory");
+      odata.put(base.limit(byte_size.get()));
+    }
 
-        // We make a copy of the data here... which we could avoid for
-        // performance reasons but ok for this sequence example.
-        BytePointer odata = new BytePointer(byte_size.get());
-        output_data.put(name, odata);
-        System.out.println(name + " is stored in system memory");
-        odata.put(base.limit(byte_size.get()));
+    int out = new IntPointer(output_data.get(output0)).get(0);
+    System.out.println("Value: " + out);
+    if (sequence_end) {
+      expected_result =
+          GetExpectedResult(model_name, expected_result, input_value, "end");
+      if (out != expected_result) {
+        FAIL("Expected result: " + expected_result + ", got " + out);
+      } else {
+        System.out.println(model_name + " test PASSED");
       }
-
-      int out = new IntPointer(output_data.get(output0)).get(0);
-      System.out.println("Value: " + out);
-      if(sequence_end){
-        expected_result = GetExpectedResult(model_name, expected_result,
-            input_value, "end");
-        if(out != expected_result){
-          FAIL("Expected result: " + expected_result + ", got " + out);
-        } else {
-          System.out.println(model_name + " test PASSED");
-        }
+    }
+  }
+
+  // Boilerplate main function to run inference
+  // for provided model, custom setting of
+  // sequence metadata
+  public static void main(String[] args) throws Exception
+  {
+    String model_repository_path = null;
+    String model_name = null;
+    int verbose_level = 0;
+
+    // Parse commandline...
+    for (int i = 0; i < args.length; i++) {
+      switch (args[i]) {
+        case "-m":
+          model_name = args[++i];
+          break;
+        case "-r":
+          model_repository_path = args[++i];
+          break;
+        case "-v":
+          verbose_level = 1;
+          break;
+        case "-?":
+          Usage(null);
+          break;
       }
     }
 
-    // Boilerplate main function to run inference
-    // for provided model, custom setting of
-    // sequence metadata
-    public static void
-    main(String[] args) throws Exception
-    {
-      String model_repository_path = null;
-      String model_name = null;
-      int verbose_level = 0;
-
-      // Parse commandline...
-      for (int i = 0; i < args.length; i++) {
-        switch (args[i]) {
-          case "-m":
-            model_name = args[++i];
-            break;
-          case "-r":
-            model_repository_path = args[++i];
-            break;
-          case "-v":
-            verbose_level = 1;
-            break;
-          case "-?":
-            Usage(null);
-            break;
-        }
-      }
+    if (model_name == null) {
+      Usage("-m must be used to specify model name");
+    }
+    if (model_repository_path == null) {
+      Usage("-r must be used to specify model repository path");
+    }
 
-      if(model_name == null) {
-        Usage("-m must be used to specify model name");
-      }
-      if (model_repository_path == null) {
-        Usage("-r must be used to specify model repository path");
-      }
+    // Check API version.
+    int[] api_version_major = {0}, api_version_minor = {0};
+    FAIL_IF_ERR(
+        TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
+        "getting Triton API version");
+    if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0])
+        || (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
+      FAIL("triton server API version mismatch");
+    }
 
-      // Check API version.
-      int[] api_version_major = {0}, api_version_minor = {0};
+    // Create the server...
+    TRITONSERVER_ServerOptions server_options =
+        new TRITONSERVER_ServerOptions(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsNew(server_options),
+        "creating server options");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetModelRepositoryPath(
+            server_options, model_repository_path),
+        "setting model repository path");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
+        "setting verbose logging level");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetBackendDirectory(
+            server_options, "/opt/tritonserver/backends"),
+        "setting backend directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
+            server_options, "/opt/tritonserver/repoagents"),
+        "setting repository agent directory");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
+        "setting strict model configuration");
+
+    TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
+    FAIL_IF_ERR(
+        TRITONSERVER_ServerOptionsDelete(server_options),
+        "deleting server options");
+
+    TRITONSERVER_ServerDeleter server =
+        new TRITONSERVER_ServerDeleter(server_ptr);
+
+    // Wait until the server is both live and ready.
+    int health_iters = 0;
+    while (true) {
+      boolean[] live = {false}, ready = {false};
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerIsLive(server, live),
+          "unable to get server liveness");
       FAIL_IF_ERR(
-          TRITONSERVER_ApiVersion(api_version_major, api_version_minor),
-          "getting Triton API version");
-      if ((TRITONSERVER_API_VERSION_MAJOR != api_version_major[0]) ||
-          (TRITONSERVER_API_VERSION_MINOR > api_version_minor[0])) {
-        FAIL("triton server API version mismatch");
+          TRITONSERVER_ServerIsReady(server, ready),
+          "unable to get server readiness");
+      System.out.println(
+          "Server Health: live " + live[0] + ", ready " + ready[0]);
+      if (live[0] && ready[0]) {
+        break;
       }
 
-      // Create the server...
-      TRITONSERVER_ServerOptions server_options = new TRITONSERVER_ServerOptions(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsNew(server_options),
-          "creating server options");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetModelRepositoryPath(
-              server_options, model_repository_path),
-          "setting model repository path");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetLogVerbose(server_options, verbose_level),
-          "setting verbose logging level");
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetBackendDirectory(
-              server_options, "/opt/tritonserver/backends"),
-          "setting backend directory");
+      if (++health_iters >= 10) {
+        FAIL("failed to find healthy inference server");
+      }
+
+      Thread.sleep(500);
+    }
+
+    // Print status of the server.
+    {
+      TRITONSERVER_Message server_metadata_message =
+          new TRITONSERVER_Message(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetRepoAgentDirectory(
-              server_options, "/opt/tritonserver/repoagents"),
-          "setting repository agent directory");
+          TRITONSERVER_ServerMetadata(server, server_metadata_message),
+          "unable to get server metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsSetStrictModelConfig(server_options, true),
-          "setting strict model configuration");
+          TRITONSERVER_MessageSerializeToJson(
+              server_metadata_message, buffer, byte_size),
+          "unable to serialize server metadata message");
+
+      System.out.println("Server Status:");
+      System.out.println(buffer.limit(byte_size.get()).getString());
 
-      TRITONSERVER_Server server_ptr = new TRITONSERVER_Server(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_ServerNew(server_ptr, server_options), "creating server");
       FAIL_IF_ERR(
-          TRITONSERVER_ServerOptionsDelete(server_options),
-          "deleting server options");
-
-      TRITONSERVER_ServerDeleter server = new TRITONSERVER_ServerDeleter(server_ptr);
-
-      // Wait until the server is both live and ready.
-      int health_iters = 0;
-      while (true) {
-        boolean[] live = {false}, ready = {false};
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsLive(server, live),
-            "unable to get server liveness");
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerIsReady(server, ready),
-            "unable to get server readiness");
-        System.out.println("Server Health: live " + live[0] + ", ready " + ready[0]);
-        if (live[0] && ready[0]) {
-          break;
-        }
+          TRITONSERVER_MessageDelete(server_metadata_message),
+          "deleting status metadata");
+    }
 
+    // Wait for the model to become available.
+    boolean[] is_torch_model = {false};
+    boolean[] is_ready = {false};
+    health_iters = 0;
+    while (!is_ready[0]) {
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelIsReady(server, model_name, 1, is_ready),
+          "unable to get model readiness");
+      if (!is_ready[0]) {
         if (++health_iters >= 10) {
-          FAIL("failed to find healthy inference server");
+          FAIL("model failed to be ready in 10 iterations");
         }
-
         Thread.sleep(500);
+        continue;
       }
 
-      // Print status of the server.
-      {
-        TRITONSERVER_Message server_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerMetadata(server, server_metadata_message),
-            "unable to get server metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                server_metadata_message, buffer, byte_size),
-            "unable to serialize server metadata message");
-
-        System.out.println("Server Status:");
-        System.out.println(buffer.limit(byte_size.get()).getString());
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(server_metadata_message),
-            "deleting status metadata");
+      TRITONSERVER_Message model_metadata_message =
+          new TRITONSERVER_Message(null);
+      FAIL_IF_ERR(
+          TRITONSERVER_ServerModelMetadata(
+              server, model_name, 1, model_metadata_message),
+          "unable to get model metadata message");
+      BytePointer buffer = new BytePointer((Pointer) null);
+      SizeTPointer byte_size = new SizeTPointer(1);
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageSerializeToJson(
+              model_metadata_message, buffer, byte_size),
+          "unable to serialize model status protobuf");
+
+      JsonParser parser = new JsonParser();
+      JsonObject model_metadata = null;
+      try {
+        model_metadata = parser.parse(buffer.limit(byte_size.get()).getString())
+                             .getAsJsonObject();
+      }
+      catch (Exception e) {
+        FAIL("error: failed to parse model metadata from JSON: " + e);
       }
 
-      // Wait for the model to become available.
-      boolean[] is_torch_model = {false};
-      boolean[] is_ready = {false};
-      health_iters = 0;
-      while (!is_ready[0]) {
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelIsReady(
-                server, model_name, 1, is_ready),
-            "unable to get model readiness");
-        if (!is_ready[0]) {
-          if (++health_iters >= 10) {
-            FAIL("model failed to be ready in 10 iterations");
-          }
-          Thread.sleep(500);
-          continue;
-        }
-
-        TRITONSERVER_Message model_metadata_message = new TRITONSERVER_Message(null);
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerModelMetadata(
-                server, model_name, 1, model_metadata_message),
-            "unable to get model metadata message");
-        BytePointer buffer = new BytePointer((Pointer)null);
-        SizeTPointer byte_size = new SizeTPointer(1);
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageSerializeToJson(
-                model_metadata_message, buffer, byte_size),
-            "unable to serialize model status protobuf");
-
-        JsonParser parser = new JsonParser();
-        JsonObject model_metadata = null;
-        try {
-          model_metadata = parser.parse(buffer.limit(byte_size.get()).getString()).getAsJsonObject();
-        } catch (Exception e) {
-          FAIL("error: failed to parse model metadata from JSON: " + e);
-        }
-
-        FAIL_IF_ERR(
-            TRITONSERVER_MessageDelete(model_metadata_message),
-            "deleting status protobuf");
+      FAIL_IF_ERR(
+          TRITONSERVER_MessageDelete(model_metadata_message),
+          "deleting status protobuf");
 
-        if (!model_metadata.get("name").getAsString().equals(model_name)) {
-          FAIL("unable to find metadata for model");
-        }
+      if (!model_metadata.get("name").getAsString().equals(model_name)) {
+        FAIL("unable to find metadata for model");
+      }
 
-        boolean found_version = false;
-        if (model_metadata.has("versions")) {
-          for (JsonElement version : model_metadata.get("versions").getAsJsonArray()) {
-            if (version.getAsString().equals("1")) {
-              found_version = true;
-              break;
-            }
+      boolean found_version = false;
+      if (model_metadata.has("versions")) {
+        for (JsonElement version :
+             model_metadata.get("versions").getAsJsonArray()) {
+          if (version.getAsString().equals("1")) {
+            found_version = true;
+            break;
           }
         }
-        if (!found_version) {
-          FAIL("unable to find version 1 status for model");
-        }
-
-        FAIL_IF_ERR(
-            ParseModelMetadata(model_metadata, is_torch_model),
-            "parsing model metadata");
+      }
+      if (!found_version) {
+        FAIL("unable to find version 1 status for model");
       }
 
-      // Create the allocator that will be used to allocate buffers for
-      // the result tensors.
-      TRITONSERVER_ResponseAllocator allocator = new TRITONSERVER_ResponseAllocator(null);
       FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorNew(
-              allocator, responseAlloc, responseRelease, null /* start_fn */),
-          "creating response allocator");
+          ParseModelMetadata(model_metadata, is_torch_model),
+          "parsing model metadata");
+    }
 
-      // Inference
-      TRITONSERVER_InferenceRequest irequest = new TRITONSERVER_InferenceRequest(null);
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestNew(
-              irequest, server, model_name, -1 /* model_version */),
-          "creating inference request");
+    // Create the allocator that will be used to allocate buffers for
+    // the result tensors.
+    TRITONSERVER_ResponseAllocator allocator =
+        new TRITONSERVER_ResponseAllocator(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorNew(
+            allocator, responseAlloc, responseRelease, null /* start_fn */),
+        "creating response allocator");
+
+    // Inference
+    TRITONSERVER_InferenceRequest irequest =
+        new TRITONSERVER_InferenceRequest(null);
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestNew(
+            irequest, server, model_name, -1 /* model_version */),
+        "creating inference request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
+        "setting ID for the request");
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestSetReleaseCallback(
+            irequest, inferRequestComplete, null /* request_release_userp */),
+        "setting request release callback");
+
+    // Inputs
+    String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT";
+
+    long[] input0_shape = {1};
+
+    int datatype = TRITONSERVER_TYPE_INT32;
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddInput(
+            irequest, input0, datatype, input0_shape, input0_shape.length),
+        "setting input 0 meta-data for the request");
+
+    String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT";
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
+        "requesting output 0 for the request");
+
+    // Non-zero ID for the sequence requests
+    long correlation_id = 5;
+    // Number of requests in the sequence
+    int num_requests = 9;
+    // Expected_result is  1+2+3+...+num_requests
+    int expected_result = num_requests * (1 + num_requests) / 2;
+    boolean sequence_start = true;
+    boolean sequence_end = false;
+
+    // Create the initial data for the input tensor.
+    IntPointer[] p0 = {new IntPointer(1)};
+    BytePointer input0_data = p0[0].getPointer(BytePointer.class);
+    long input0_size = input0_data.limit();
+
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestAppendInputData(
+            irequest, input0, input0_data, input0_size, requested_memory_type,
+            0 /* memory_type_id */),
+        "assigning INPUT0 data");
+
+    for (int i = 0; i < num_requests; i++) {
+      // Update input value
+      int input = i + 1;
+      p0[0].put(0, input);
+
+      // Set sequence metadata
+      if (i == 1) {
+        sequence_start = false;
+      }
+      if (i == num_requests - 1) {
+        sequence_end = true;
+      }
+      SetSequenceMetadata(
+          irequest, correlation_id, sequence_start, sequence_end);
 
-      FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetId(irequest, "my_request_id"),
-          "setting ID for the request");
+      // Perform inference...
+      CompletableFuture completed =
+          new CompletableFuture<>();
+      futures.put(irequest, completed);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestSetReleaseCallback(
-              irequest, inferRequestComplete, null /* request_release_userp */),
-          "setting request release callback");
-
-      // Inputs
-      String input0 = is_torch_model[0] ? "INPUT__0" : "INPUT";
-
-      long[] input0_shape = {1};
-
-      int datatype = TRITONSERVER_TYPE_INT32;
+          TRITONSERVER_InferenceRequestSetResponseCallback(
+              irequest, allocator, null /* response_allocator_userp */,
+              inferResponseComplete, irequest),
+          "setting response callback");
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddInput(
-              irequest, input0, datatype, input0_shape, input0_shape.length),
-          "setting input 0 meta-data for the request");
+          TRITONSERVER_ServerInferAsync(server, irequest, null /* trace */),
+          "running inference");
 
-      String output0 = is_torch_model[0] ? "OUTPUT__0" : "OUTPUT";
+      // Wait for the inference to complete.
+      TRITONSERVER_InferenceResponse completed_response = completed.get();
+      futures.remove(irequest);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestAddRequestedOutput(irequest, output0),
-          "requesting output 0 for the request");
-
-      // Non-zero ID for the sequence requests
-      long correlation_id = 5;
-      // Number of requests in the sequence
-      int num_requests = 9;
-      // Expected_result is  1+2+3+...+num_requests
-      int expected_result = num_requests * (1 + num_requests) / 2;
-      boolean sequence_start = true;
-      boolean sequence_end = false;
-
-      // Create the initial data for the input tensor.
-      IntPointer[] p0 = {new IntPointer(1)};
-      BytePointer input0_data = p0[0].getPointer(BytePointer.class);
-      long input0_size = input0_data.limit();
+          TRITONSERVER_InferenceResponseError(completed_response),
+          "response status");
 
-      FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestAppendInputData(
-                irequest, input0, input0_data, input0_size, requested_memory_type,
-                0 /* memory_type_id */),
-            "assigning INPUT0 data");
-
-      for(int i = 0; i < num_requests; i++) {
-        // Update input value
-        int input = i + 1;
-        p0[0].put(0, input);
-
-        // Set sequence metadata
-        if(i == 1) {
-          sequence_start = false;
-        }
-        if(i == num_requests - 1) {
-          sequence_end = true;
-        }
-        SetSequenceMetadata(irequest, correlation_id, sequence_start, sequence_end);
-        
-        // Perform inference...
-        CompletableFuture completed = new CompletableFuture<>();
-        futures.put(irequest, completed);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceRequestSetResponseCallback(
-                irequest, allocator, null /* response_allocator_userp */,
-                inferResponseComplete, irequest),
-            "setting response callback");
-
-        FAIL_IF_ERR(
-            TRITONSERVER_ServerInferAsync(
-                server, irequest, null /* trace */),
-            "running inference");
-
-        // Wait for the inference to complete.
-        TRITONSERVER_InferenceResponse completed_response = completed.get();
-        futures.remove(irequest);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseError(completed_response),
-            "response status");
-
-        Check(
-            model_name, completed_response, input, output0, input0_size,
-            datatype, sequence_end, expected_result);
-
-        FAIL_IF_ERR(
-            TRITONSERVER_InferenceResponseDelete(completed_response),
-            "deleting inference response");
-      }
+      Check(
+          model_name, completed_response, input, output0, input0_size, datatype,
+          sequence_end, expected_result);
 
       FAIL_IF_ERR(
-          TRITONSERVER_InferenceRequestDelete(irequest),
-          "deleting inference request");
+          TRITONSERVER_InferenceResponseDelete(completed_response),
+          "deleting inference response");
+    }
 
-      FAIL_IF_ERR(
-          TRITONSERVER_ResponseAllocatorDelete(allocator),
-          "deleting response allocator");
+    FAIL_IF_ERR(
+        TRITONSERVER_InferenceRequestDelete(irequest),
+        "deleting inference request");
 
-      System.exit(0);
-    }
+    FAIL_IF_ERR(
+        TRITONSERVER_ResponseAllocatorDelete(allocator),
+        "deleting response allocator");
+
+    System.exit(0);
+  }
 }
diff --git a/qa/L0_java_sequence_batcher/test.sh b/qa/L0_java_sequence_batcher/test.sh
index 1fe3a97fb2..2f988322d9 100755
--- a/qa/L0_java_sequence_batcher/test.sh
+++ b/qa/L0_java_sequence_batcher/test.sh
@@ -40,17 +40,15 @@ fi
 
 # Models
 DATADIR=/data/inferenceserver/${REPO_VERSION}
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 
 # Set up test files based on installation instructions
 # https://github.com/bytedeco/javacpp-presets/blob/master/tritonserver/README.md
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 CLIENT_LOG="client.log"
 MODEL_REPO=`pwd`/models
diff --git a/qa/L0_java_simple_example/test.sh b/qa/L0_java_simple_example/test.sh
index b3a54d6a11..e9726edff4 100755
--- a/qa/L0_java_simple_example/test.sh
+++ b/qa/L0_java_simple_example/test.sh
@@ -37,14 +37,12 @@ if [ -z "$REPO_VERSION" ]; then
     exit 1
 fi
 
-set +e
-rm -r javacpp-presets
-git clone https://github.com/bytedeco/javacpp-presets.git
-cd javacpp-presets
-mvn clean install --projects .,tritonserver
-mvn clean install -f platform --projects ../tritonserver/platform -Djavacpp.platform.host
-cd ..
+JAVACPP_BRANCH=${JAVACPP_BRANCH:="https://github.com/bytedeco/javacpp-presets.git"}
+JAVACPP_BRANCH_TAG=${JAVACPP_BRANCH_TAG:="master"}
 set -e
+git clone --single-branch --depth=1 -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git
+source client/src/java-api-bindings/scripts/install_dependencies_and_build.sh -b $PWD --javacpp-branch ${JAVACPP_BRANCH} --javacpp-tag ${JAVACPP_BRANCH_TAG} --keep-build-dependencies
+cd ..
 
 CLIENT_LOG="client_cpu_only.log"
 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
diff --git a/qa/L0_json/test.sh b/qa/L0_json/test.sh
new file mode 100755
index 0000000000..522e17aa95
--- /dev/null
+++ b/qa/L0_json/test.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+RET=0
+UNIT_TEST="./triton_json_test"
+TEST_LOG="./triton_json_test.log"
+$UNIT_TEST >> $TEST_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $TEST_LOG
+    echo -e "\n***\n*** Triton Json Unit Test Failed\n***"
+    RET=1
+fi
+
+if [ $RET -eq 0 ]; then
+  echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_large_payload/large_payload_test.py b/qa/L0_large_payload/large_payload_test.py
old mode 100644
new mode 100755
index 5ad0939a6f..fff57290ef
--- a/qa/L0_large_payload/large_payload_test.py
+++ b/qa/L0_large_payload/large_payload_test.py
@@ -1,4 +1,6 @@
-# Copyright 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,19 +27,20 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import math
 import unittest
+
 import numpy as np
 import test_util as tu
 import tritongrpcclient as grpcclient
 import tritonhttpclient as httpclient
-from tritonclientutils import np_to_triton_dtype, InferenceServerException
+from tritonclientutils import InferenceServerException, np_to_triton_dtype
 
 
 class LargePayLoadTest(tu.TestResultCollector):
-
     def setUp(self):
         self._data_type = np.float32
 
@@ -45,36 +48,40 @@ def setUp(self):
         # hard limit on 2GBs for the size of input tensors. All backends except
         # plan backend should be able to handle payloads larger than 2GBs using
         # HTTP.
-        very_large_tensor_shape = (math.trunc(
-            3 * (1024 * 1024 * 1024) / np.dtype(self._data_type).itemsize),)
+        very_large_tensor_shape = (
+            math.trunc(3 * (1024 * 1024 * 1024) / np.dtype(self._data_type).itemsize),
+        )
         self._very_large_in0 = np.random.random(very_large_tensor_shape).astype(
-            self._data_type)
+            self._data_type
+        )
 
         # 1.9 GBs allows us to test gRPC with moderate sizes too.
-        large_tensor_shape = (math.trunc(1.9 * (1024 * 1024 * 1024) //
-                                         np.dtype(self._data_type).itemsize),)
-        self._large_in0 = np.random.random(large_tensor_shape).astype(
-            self._data_type)
+        large_tensor_shape = (
+            math.trunc(
+                1.9 * (1024 * 1024 * 1024) // np.dtype(self._data_type).itemsize
+            ),
+        )
+        self._large_in0 = np.random.random(large_tensor_shape).astype(self._data_type)
 
         small_tensor_shape = (1,)
-        self._small_in0 = np.random.random(small_tensor_shape).astype(
-            self._data_type)
-
-        self._clients = ((httpclient,
-                          httpclient.InferenceServerClient('localhost:8000')),
-                         (grpcclient,
-                          grpcclient.InferenceServerClient('localhost:8001')))
-
-    def _test_helper(self,
-                     client,
-                     model_name,
-                     input_name='INPUT0',
-                     output_name='OUTPUT0'):
-        # plan does not supoort large batch sizes.
-        if not model_name.startswith('plan'):
+        self._small_in0 = np.random.random(small_tensor_shape).astype(self._data_type)
+
+        self._clients = (
+            (httpclient, httpclient.InferenceServerClient("localhost:8000")),
+            (grpcclient, grpcclient.InferenceServerClient("localhost:8001")),
+        )
+
+    def _test_helper(
+        self, client, model_name, input_name="INPUT0", output_name="OUTPUT0"
+    ):
+        # plan does not support large batch sizes.
+        if not model_name.startswith("plan"):
             inputs = [
-                client[0].InferInput(input_name, self._large_in0.shape,
-                                     np_to_triton_dtype(self._data_type))
+                client[0].InferInput(
+                    input_name,
+                    self._large_in0.shape,
+                    np_to_triton_dtype(self._data_type),
+                )
             ]
             inputs[0].set_data_from_numpy(self._large_in0)
             results = client[1].infer(model_name, inputs)
@@ -83,13 +90,17 @@ def _test_helper(self,
             # the framework and protocol do support large payload
             self.assertTrue(
                 np.array_equal(self._large_in0, results.as_numpy(output_name)),
-                "output is different from input")
+                "output is different from input",
+            )
 
         if client[0] == httpclient:
             # FIXME HTTPServer cannot support large payloads. See DLIS-1776.
             inputs = [
-                client[0].InferInput(input_name, self._very_large_in0.shape,
-                                     np_to_triton_dtype(self._data_type))
+                client[0].InferInput(
+                    input_name,
+                    self._very_large_in0.shape,
+                    np_to_triton_dtype(self._data_type),
+                )
             ]
             inputs[0].set_data_from_numpy(self._very_large_in0)
             with self.assertRaises(InferenceServerException):
@@ -112,56 +123,54 @@ def _test_helper(self,
 
         # Send a small payload to verify if the server is still functional
         inputs = [
-            client[0].InferInput(input_name, self._small_in0.shape,
-                                 np_to_triton_dtype(self._data_type))
+            client[0].InferInput(
+                input_name, self._small_in0.shape, np_to_triton_dtype(self._data_type)
+            )
         ]
         inputs[0].set_data_from_numpy(self._small_in0)
         results = client[1].infer(model_name, inputs)
         self.assertTrue(
             np.array_equal(self._small_in0, results.as_numpy(output_name)),
-            "output is different from input")
+            "output is different from input",
+        )
 
     def test_graphdef(self):
         # graphdef_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("graphdef_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("graphdef_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_savedmodel(self):
         # savedmodel_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("savedmodel_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name(
+                "savedmodel_nobatch", 1, self._data_type
+            )
             self._test_helper(client, model_name)
 
     def test_onnx(self):
         # onnx_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("onnx_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("onnx_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_python(self):
         # python_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("python_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("python_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_plan(self):
         # plan_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("plan_nobatch", 1,
-                                                self._data_type)
+            model_name = tu.get_zero_model_name("plan_nobatch", 1, self._data_type)
             self._test_helper(client, model_name)
 
     def test_libtorch(self):
         # libtorch_nobatch_zero_1_float32 is identity model with input shape [-1]
         for client in self._clients:
-            model_name = tu.get_zero_model_name("libtorch_nobatch", 1,
-                                                self._data_type)
-            self._test_helper(client, model_name, 'INPUT__0', 'OUTPUT__0')
+            model_name = tu.get_zero_model_name("libtorch_nobatch", 1, self._data_type)
+            self._test_helper(client, model_name, "INPUT__0", "OUTPUT__0")
 
     def test_custom(self):
         # custom_zero_1_float32 is identity model with input shape [-1]
@@ -170,5 +179,5 @@ def test_custom(self):
             self._test_helper(client, model_name)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_large_payload/test.sh b/qa/L0_large_payload/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_libtorch_inference_mode/test.sh b/qa/L0_libtorch_inference_mode/test.sh
old mode 100644
new mode 100755
index 5017f12769..85b4a49fae
--- a/qa/L0_libtorch_inference_mode/test.sh
+++ b/qa/L0_libtorch_inference_mode/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright (c) 2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -73,7 +73,7 @@ for FLAG in true false; do
 
     set +e
 
-    python $SIMPLE_INFER_CLIENT_PY >> $CLIENT_LOG 2>&1
+    python $LIBTORCH_INFER_CLIENT_PY >> $CLIENT_LOG 2>&1
     if [ $? -ne 0 ]; then
         RET=1
     fi
diff --git a/qa/L0_libtorch_instance_group_kind_model/client.py b/qa/L0_libtorch_instance_group_kind_model/client.py
new file mode 100755
index 0000000000..92bead3464
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/client.py
@@ -0,0 +1,90 @@
+#!/usr/bin/env python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import os
+import sys
+
+sys.path.append("../common")
+
+import unittest
+
+import numpy as np
+import test_util as tu
+import tritonclient.http as httpclient
+
+# By default, find tritonserver on "localhost", but can be overridden
+# with TRITONSERVER_IPADDR envvar
+_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost")
+
+
+class InferTest(tu.TestResultCollector):
+    def test_infer(self):
+        try:
+            triton_client = httpclient.InferenceServerClient(
+                url=f"{_tritonserver_ipaddr}:8000"
+            )
+        except Exception as e:
+            print("channel creation failed: " + str(e))
+            sys.exit(1)
+
+        model_name = os.environ["MODEL_NAME"]
+
+        inputs = []
+        outputs = []
+        inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32"))
+        inputs.append(httpclient.InferInput("INPUT1", [1, 16], "FP32"))
+
+        # Create the data for the two input tensors.
+        input0_data = np.arange(start=0, stop=16, dtype=np.float32)
+        input0_data = np.expand_dims(input0_data, axis=0)
+        input1_data = np.arange(start=32, stop=48, dtype=np.float32)
+        input1_data = np.expand_dims(input1_data, axis=0)
+
+        # Initialize the data
+        inputs[0].set_data_from_numpy(input0_data, binary_data=True)
+        inputs[1].set_data_from_numpy(input1_data, binary_data=True)
+
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT__0", binary_data=True))
+        outputs.append(httpclient.InferRequestedOutput("OUTPUT__1", binary_data=True))
+
+        results = triton_client.infer(model_name, inputs, outputs=outputs)
+
+        output0_data = results.as_numpy("OUTPUT__0")
+        output1_data = results.as_numpy("OUTPUT__1")
+
+        expected_output_0 = input0_data + input1_data
+        expected_output_1 = input0_data - input1_data
+
+        self.assertEqual(output0_data.shape, (1, 16))
+        self.assertEqual(output1_data.shape, (1, 16))
+
+        self.assertTrue(np.all(expected_output_0 == output0_data))
+        self.assertTrue(np.all(expected_output_1 == output1_data))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_libtorch_instance_group_kind_model/gen_models.py b/qa/L0_libtorch_instance_group_kind_model/gen_models.py
new file mode 100755
index 0000000000..e61980f491
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/gen_models.py
@@ -0,0 +1,90 @@
+#!/usr/bin/python
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import torch
+import torch.nn as nn
+
+
+class SumModule(nn.Module):
+    def __init__(self, device):
+        super(SumModule, self).__init__()
+        self.device = device
+
+    def forward(self, INPUT0, INPUT1):
+        INPUT0 = INPUT0.to(self.device)
+        INPUT1 = INPUT1.to(self.device)
+        print(
+            "SumModule - INPUT0 device: {}, INPUT1 device: {}\n".format(
+                INPUT0.device, INPUT1.device
+            )
+        )
+        return INPUT0 + INPUT1
+
+
+class DiffModule(nn.Module):
+    def __init__(self, device):
+        super(DiffModule, self).__init__()
+        self.device = device
+
+    def forward(self, INPUT0, INPUT1):
+        INPUT0 = INPUT0.to(self.device)
+        INPUT1 = INPUT1.to(self.device)
+        print(
+            "DiffModule - INPUT0 device: {}, INPUT1 device: {}\n".format(
+                INPUT0.device, INPUT1.device
+            )
+        )
+        return INPUT0 - INPUT1
+
+
+class TestModel(nn.Module):
+    def __init__(self, device0, device1):
+        super(TestModel, self).__init__()
+        self.device0 = device0
+        self.device1 = device1
+
+        self.layer1 = SumModule(self.device0)
+        self.layer2 = DiffModule(self.device1)
+
+    def forward(self, INPUT0, INPUT1):
+        op0 = self.layer1(INPUT0, INPUT1)
+        op1 = self.layer2(INPUT0, INPUT1)
+        return op0, op1
+
+
+if torch.cuda.device_count() < 4:
+    print("Need at least 4 GPUs to run this test")
+    exit(1)
+
+devices = [("cuda:2", "cuda:0"), ("cpu", "cuda:3")]
+model_names = ["libtorch_multi_gpu", "libtorch_multi_device"]
+
+for device_pair, model_name in zip(devices, model_names):
+    model = TestModel(device_pair[0], device_pair[1])
+    model_path = "models/" + model_name + "/1/model.pt"
+    scripted_model = torch.jit.script(model)
+    scripted_model.save(model_path)
diff --git a/qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt b/qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt
new file mode 100644
index 0000000000..bf8ca0d649
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/models/libtorch_multi_device/config.pbtxt
@@ -0,0 +1,60 @@
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "libtorch_multi_device"
+platform: "pytorch_libtorch"
+max_batch_size: 8
+
+input [
+  {
+    name: "INPUT0"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  },
+  {
+    name: "INPUT1"
+    data_type: TYPE_FP32
+    dims: [ 16 ]
+  }
+]
+output [
+  {
+    name: "OUTPUT__0"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  },
+  {
+    name: "OUTPUT__1"
+    data_type: TYPE_FP32
+    dims: [ 4 ]
+  }
+]
+
+instance_group [
+  {
+    kind: KIND_MODEL
+  }
+]
diff --git a/qa/L0_libtorch_instance_group_kind_model/test.sh b/qa/L0_libtorch_instance_group_kind_model/test.sh
new file mode 100755
index 0000000000..04d76bd036
--- /dev/null
+++ b/qa/L0_libtorch_instance_group_kind_model/test.sh
@@ -0,0 +1,149 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+pip3 uninstall -y torch
+pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html
+
+DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=models --log-verbose=1"
+SERVER_LOG="./inference_server.log"
+
+CLIENT_PY=./client.py
+CLIENT_LOG="./client.log"
+EXPECTED_NUM_TESTS="1"
+TEST_RESULT_FILE='test_results.txt'
+
+source ../common/util.sh
+
+RET=0
+
+rm -f *.log *.txt
+
+mkdir -p models/libtorch_multi_device/1
+mkdir -p models/libtorch_multi_gpu/1
+cp models/libtorch_multi_device/config.pbtxt models/libtorch_multi_gpu/.
+(cd models/libtorch_multi_gpu && \
+    sed -i "s/name: \"libtorch_multi_device\"/name: \"libtorch_multi_gpu\"/" config.pbtxt)
+
+# Generate the models which are partitioned across multiple devices
+set +e
+python3 gen_models.py >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Error when generating models. \n***"
+    cat $CLIENT_LOG
+    exit 1
+fi
+set -e
+
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+export MODEL_NAME='libtorch_multi_device'
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Model $MODEL_NAME FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+MESSAGES=("SumModule - INPUT0 device: cpu, INPUT1 device: cpu"
+    "DiffModule - INPUT0 device: cuda:3, INPUT1 device: cuda:3")
+for MESSAGE in "${MESSAGES[@]}"; do
+    if grep -q "$MESSAGE" "$SERVER_LOG"; then
+        echo -e "Found \"$MESSAGE\"" >> "$CLIENT_LOG"
+    else
+        echo -e "Not found \"$MESSAGE\"" >> "$CLIENT_LOG"
+        RET=1
+    fi
+done
+
+export MODEL_NAME='libtorch_multi_gpu'
+python3 $CLIENT_PY >> $CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    echo -e "\n***\n*** Model $MODEL_NAME FAILED. \n***"
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+MESSAGES=("SumModule - INPUT0 device: cuda:2, INPUT1 device: cuda:2"
+    "DiffModule - INPUT0 device: cuda:0, INPUT1 device: cuda:0")
+for MESSAGE in "${MESSAGES[@]}"; do
+    if grep -q "$MESSAGE" "$SERVER_LOG"; then
+        echo -e "Found \"$MESSAGE\"" >> "$CLIENT_LOG"
+    else
+        echo -e "Not found \"$MESSAGE\"" >> "$CLIENT_LOG"
+        RET=1
+    fi
+done
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+exit $RET
diff --git a/qa/L0_libtorch_io_names/io_names_client.py b/qa/L0_libtorch_io_names/io_names_client.py
old mode 100644
new mode 100755
index 54cf972778..b74e520de2
--- a/qa/L0_libtorch_io_names/io_names_client.py
+++ b/qa/L0_libtorch_io_names/io_names_client.py
@@ -1,5 +1,5 @@
 #!/usr/bin/python
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,24 +26,22 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-from builtins import range
-from future.utils import iteritems
 import unittest
-import test_util as tu
-import numpy as np
+from builtins import range
 
+import numpy as np
+import test_util as tu
 import tritonclient.http as httpclient
-from tritonclient.utils import np_to_triton_dtype
-from tritonclient.utils import InferenceServerException
 
 
 class IONamingConvention(tu.TestResultCollector):
-
     def _infer_helper(self, model_name, io_names, reversed_order=False):
-        triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                         verbose=False)
+        triton_client = httpclient.InferenceServerClient(
+            "localhost:8000", verbose=False
+        )
 
         # Create the data for the two inputs. Initialize the first to unique
         # integers and the second to all ones.
@@ -55,30 +53,34 @@ def _infer_helper(self, model_name, io_names, reversed_order=False):
         output_req = []
         inputs.append(
             httpclient.InferInput(
-                io_names[0] if not reversed_order else io_names[1], [1, 16],
-                "FP32"))
+                io_names[0] if not reversed_order else io_names[1], [1, 16], "FP32"
+            )
+        )
         inputs[-1].set_data_from_numpy(input0_data)
         inputs.append(
             httpclient.InferInput(
-                io_names[1] if not reversed_order else io_names[0], [1, 16],
-                "FP32"))
+                io_names[1] if not reversed_order else io_names[0], [1, 16], "FP32"
+            )
+        )
         inputs[-1].set_data_from_numpy(input1_data)
         output_req.append(
-            httpclient.InferRequestedOutput(io_names[2], binary_data=True))
+            httpclient.InferRequestedOutput(io_names[2], binary_data=True)
+        )
         output_req.append(
-            httpclient.InferRequestedOutput(io_names[3], binary_data=True))
+            httpclient.InferRequestedOutput(io_names[3], binary_data=True)
+        )
 
         results = triton_client.infer(model_name, inputs, outputs=output_req)
 
         output0_data = results.as_numpy(
-            io_names[2] if not reversed_order else io_names[3])
+            io_names[2] if not reversed_order else io_names[3]
+        )
         output1_data = results.as_numpy(
-            io_names[3] if not reversed_order else io_names[2])
+            io_names[3] if not reversed_order else io_names[2]
+        )
         for i in range(16):
-            self.assertEqual(input0_data[0][i] - input1_data[0][i],
-                             output0_data[0][i])
-            self.assertEqual(input0_data[0][i] + input1_data[0][i],
-                             output1_data[0][i])
+            self.assertEqual(input0_data[0][i] - input1_data[0][i], output0_data[0][i])
+            self.assertEqual(input0_data[0][i] + input1_data[0][i], output1_data[0][i])
 
     def test_io_index(self):
         io_names = ["INPUT__0", "INPUT__1", "OUTPUT__0", "OUTPUT__1"]
@@ -110,10 +112,8 @@ def test_mix_arguments_index(self):
 
     def test_unordered_index(self):
         io_names = ["INPUT1", "INPUT0", "OUT__1", "OUT__0"]
-        self._infer_helper("libtorch_unordered_index",
-                           io_names,
-                           reversed_order=True)
+        self._infer_helper("libtorch_unordered_index", io_names, reversed_order=True)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_libtorch_io_names/test.sh b/qa/L0_libtorch_io_names/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_libtorch_io_types/test.sh b/qa/L0_libtorch_io_types/test.sh
new file mode 100755
index 0000000000..ddd38810b6
--- /dev/null
+++ b/qa/L0_libtorch_io_types/test.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+SERVER=/opt/tritonserver/bin/tritonserver
+SERVER_ARGS="--model-repository=models"
+SERVER_LOG="./server.log"
+DATADIR=/data/inferenceserver/${REPO_VERSION}
+source ../common/util.sh
+
+# Test unsupported INPUT data type
+rm -rf models && mkdir -p models
+cp -r $DATADIR/qa_model_repository/libtorch_int32_int8_int8 models/libtorch_invalid_input_type && \
+    sed -i 's/libtorch_int32_int8_int8/libtorch_invalid_input_type/' models/libtorch_invalid_input_type/config.pbtxt && \
+    sed -i 's/TYPE_INT32/TYPE_UINT32/' models/libtorch_invalid_input_type/config.pbtxt
+
+rm -f *.log
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    exit 1
+fi
+
+set +e
+grep "unsupported datatype TYPE_UINT32 for input 'INPUT0' for model 'libtorch_invalid_input_type'" $SERVER_LOG
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unsupported INPUT datatype not found in server log\n***"
+    exit 1
+fi
+set -e
+
+# Test unsupported OUTPUT data type
+rm -rf models && mkdir -p models
+cp -r $DATADIR/qa_model_repository/libtorch_int32_int8_int8 models/libtorch_invalid_output_type && \
+    sed -i 's/libtorch_int32_int8_int8/libtorch_invalid_output_type/' models/libtorch_invalid_output_type/config.pbtxt && \
+    sed -i 's/TYPE_INT8/TYPE_UINT64/' models/libtorch_invalid_output_type/config.pbtxt
+
+rm -f *.log
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    exit 1
+fi
+
+set +e
+grep "unsupported datatype TYPE_UINT64 for output 'OUTPUT__0' for model 'libtorch_invalid_output_type'" $SERVER_LOG
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unsupported OUTPUT datatype not found in server log\n***"
+    exit 1
+fi
+set -e
+
+# Test unsupported sequence_batching data type
+rm -rf models && mkdir -p models
+cp -r $DATADIR/qa_variable_sequence_model_repository/libtorch_sequence_int32 models/libtorch_invalid_sequence_int32 && \
+    sed -i 's/libtorch_sequence_int32/libtorch_invalid_sequence_int32/' models/libtorch_invalid_sequence_int32/config.pbtxt && \
+    sed -i 's/READY__2/CORRID__2/' models/libtorch_invalid_sequence_int32/config.pbtxt && \
+    sed -i 's/CONTROL_SEQUENCE_READY/CONTROL_SEQUENCE_CORRID/' models/libtorch_invalid_sequence_int32/config.pbtxt && \
+    sed -i ':begin;$!N;s/CORRID\n\(.*\)int32_false_true: \[ 0, 1 \]/CORRID\ndata_type: TYPE_UINT32/' models/libtorch_invalid_sequence_int32/config.pbtxt
+
+rm -f *.log
+
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unexpected server start $SERVER\n***"
+    kill $SERVER_PID
+    wait $SERVER_PID
+    exit 1
+fi
+
+set +e
+grep "input 'CORRID__2' type 'TYPE_UINT32' is not supported by PyTorch." $SERVER_LOG
+if [ $? -ne 0 ]; then
+    cat $SERVER_LOG
+    echo -e "\n***\n*** Unsupported sequence_batching datatype not found in server log\n***"
+    exit 1
+fi
+set -e
+
+# Test passed
+echo -e "\n***\n*** Test Passed\n***"
+exit 0
diff --git a/qa/L0_libtorch_nvfuser/test.sh b/qa/L0_libtorch_nvfuser/test.sh
old mode 100644
new mode 100755
index b4a31e9984..4614a66de1
--- a/qa/L0_libtorch_nvfuser/test.sh
+++ b/qa/L0_libtorch_nvfuser/test.sh
@@ -91,8 +91,7 @@ parameters: {
 
     NVFUSER_LOG="NvFuser is "
     if [ "$FLAG" == "true" ]; then
-        # NvFuser support has been disabled. Change to 'enabled' when fixed.
-        NVFUSER_LOG+="disabled"
+        NVFUSER_LOG+="enabled"
     elif [ "$FLAG" == "false" ]; then
         NVFUSER_LOG+="disabled"
     else
diff --git a/qa/L0_libtorch_optimized_execution/test.sh b/qa/L0_libtorch_optimized_execution/test.sh
old mode 100644
new mode 100755
diff --git a/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py b/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py
old mode 100644
new mode 100755
index 3f08b63962..7c2fdb5a71
--- a/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py
+++ b/qa/L0_libtorch_shared_weights/libtorch_shared_weights_test.py
@@ -1,4 +1,6 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,33 +30,29 @@
 
 sys.path.append("../common")
 
-import argparse
-import numpy as np
-import requests as httpreq
 import unittest
 from builtins import range
-import tritonhttpclient as httpclient
+
+import numpy as np
 import test_util as tu
+import tritonhttpclient as httpclient
 
 FLAGS = None
 
 
 class SharedWeightsTest(tu.TestResultCollector):
-
     def _full_exact(self, model_name, request_concurrency, shape):
-
         # Run async requests to make sure backend handles concurrent requests
         # correctly.
         client = httpclient.InferenceServerClient(
-            "localhost:8000", concurrency=request_concurrency)
+            "localhost:8000", concurrency=request_concurrency
+        )
         input_datas = []
         requests = []
         for i in range(request_concurrency):
             input_data = (16384 * np.random.randn(*shape)).astype(np.float32)
             input_datas.append(input_data)
-            inputs = [
-                httpclient.InferInput("INPUT__0", input_data.shape, "FP32")
-            ]
+            inputs = [httpclient.InferInput("INPUT__0", input_data.shape, "FP32")]
             inputs[0].set_data_from_numpy(input_data)
             requests.append(client.async_infer(model_name, inputs))
 
@@ -64,8 +62,7 @@ def _full_exact(self, model_name, request_concurrency, shape):
             results = requests[i].get_result()
 
             output_data = results.as_numpy("OUTPUT__0")
-            self.assertIsNotNone(output_data,
-                                 "error: expected 'OUTPUT__0' to be found")
+            self.assertIsNotNone(output_data, "error: expected 'OUTPUT__0' to be found")
             np.testing.assert_allclose(output_data, input_datas[i])
 
     def test_pytorch_identity_model(self):
@@ -73,5 +70,5 @@ def test_pytorch_identity_model(self):
         self._full_exact(model_name, 128, [8])
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_libtorch_shared_weights/test.sh b/qa/L0_libtorch_shared_weights/test.sh
old mode 100644
new mode 100755
index e6f23b7a45..6ca251ce32
--- a/qa/L0_libtorch_shared_weights/test.sh
+++ b/qa/L0_libtorch_shared_weights/test.sh
@@ -1,4 +1,5 @@
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/bin/bash
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
diff --git a/qa/L0_lifecycle/lifecycle_test.py b/qa/L0_lifecycle/lifecycle_test.py
old mode 100644
new mode 100755
index aaf5b033dc..ea2eecb20a
--- a/qa/L0_lifecycle/lifecycle_test.py
+++ b/qa/L0_lifecycle/lifecycle_test.py
@@ -1,4 +1,6 @@
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -28,92 +30,101 @@
 
 sys.path.append("../common")
 
-from builtins import range
-from functools import partial
+import base64
+import concurrent.futures
+import json
 import os
 import shutil
 import signal
+import threading
 import time
 import unittest
-import numpy as np
+from builtins import range
+from functools import partial
+
 import infer_util as iu
+import numpy as np
 import test_util as tu
-import threading
-
 import tritonclient.grpc as grpcclient
 import tritonclient.http as httpclient
 from tritonclient.utils import InferenceServerException
 
 
 class LifeCycleTest(tu.TestResultCollector):
-
-    def _infer_success_models(self,
-                              model_base_names,
-                              versions,
-                              tensor_shape,
-                              swap=False):
+    def _infer_success_models(
+        self, model_base_names, versions, tensor_shape, swap=False
+    ):
         for base_name in model_base_names:
             try:
-                model_name = tu.get_model_name(base_name, np.float32,
-                                               np.float32, np.float32)
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                model_name = tu.get_model_name(
+                    base_name, np.float32, np.float32, np.float32
+                )
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     # FIXME is_server_ready should be true here DLIS-1296
                     # self.assertTrue(triton_client.is_server_ready())
                     for v in versions:
                         self.assertTrue(
-                            triton_client.is_model_ready(model_name, str(v)))
+                            triton_client.is_model_ready(model_name, str(v))
+                        )
 
                 for v in versions:
-                    iu.infer_exact(self,
-                                   base_name,
-                                   tensor_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   model_version=v,
-                                   swap=(swap or (v != 1)))
+                    iu.infer_exact(
+                        self,
+                        base_name,
+                        tensor_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        model_version=v,
+                        swap=(swap or (v != 1)),
+                    )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
-    def _infer_success_identity(self, model_base, versions, tensor_dtype,
-                                tensor_shape):
+    def _infer_success_identity(self, model_base, versions, tensor_dtype, tensor_shape):
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             for v in versions:
                 self.assertTrue(
                     triton_client.is_model_ready(
-                        tu.get_zero_model_name(model_base, 1, tensor_dtype),
-                        str(v)))
+                        tu.get_zero_model_name(model_base, 1, tensor_dtype), str(v)
+                    )
+                )
 
             for v in versions:
-                iu.infer_zero(self,
-                              model_base,
-                              1,
-                              tensor_dtype,
-                              tensor_shape,
-                              tensor_shape,
-                              use_http=False,
-                              use_grpc=True,
-                              use_http_json_tensors=False,
-                              use_streaming=False)
+                iu.infer_zero(
+                    self,
+                    model_base,
+                    1,
+                    tensor_dtype,
+                    tensor_shape,
+                    tensor_shape,
+                    use_http=False,
+                    use_grpc=True,
+                    use_http_json_tensors=False,
+                    use_streaming=False,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def _get_client(self, use_grpc=False):
         if use_grpc:
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
         else:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
         return triton_client
 
     def _async_load(self, model_name, use_grpc):
@@ -129,8 +140,9 @@ def test_parse_error_noexit(self):
         # SERVER_FAILED_TO_INITIALIZE status.
         # Server is not live and not ready regardless of --strict-readiness
         try:
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             self.assertFalse(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             md = triton_client.get_server_metadata()
@@ -140,13 +152,14 @@ def test_parse_error_noexit(self):
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertFalse(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             md = triton_client.get_server_metadata()
-            self.assertEqual(os.environ["TRITON_SERVER_VERSION"], md['version'])
-            self.assertEqual("triton", md['name'])
+            self.assertEqual(os.environ["TRITON_SERVER_VERSION"], md["version"])
+            self.assertEqual("triton", md["name"])
         except InferenceServerException as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -156,17 +169,20 @@ def test_parse_error_modelfail(self):
 
         # Server was started but with a model that fails to load
         try:
-            model_name = tu.get_model_name('graphdef', np.float32, np.float32,
-                                           np.float32)
+            model_name = tu.get_model_name(
+                "graphdef", np.float32, np.float32, np.float32
+            )
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
 
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertFalse(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -175,35 +191,38 @@ def test_parse_error_modelfail(self):
 
         # Inferencing with the missing model should fail.
         try:
-            iu.infer_exact(self, 'graphdef', tensor_shape, 1, np.float32,
-                           np.float32, np.float32)
-            self.assertTrue(
-                False, "expected error for unavailable model " + model_name)
+            iu.infer_exact(
+                self, "graphdef", tensor_shape, 1, np.float32, np.float32, np.float32
+            )
+            self.assertTrue(False, "expected error for unavailable model " + model_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
         # And other models should be loaded successfully
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+            for base_name in ["savedmodel", "onnx"]:
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -213,17 +232,20 @@ def test_parse_error_modelfail_nostrict(self):
 
         # Server was started but with a model that fails to load
         try:
-            model_name = tu.get_model_name('graphdef', np.float32, np.float32,
-                                           np.float32)
+            model_name = tu.get_model_name(
+                "graphdef", np.float32, np.float32, np.float32
+            )
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
 
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -232,35 +254,38 @@ def test_parse_error_modelfail_nostrict(self):
 
         # Inferencing with the missing model should fail.
         try:
-            iu.infer_exact(self, 'graphdef', tensor_shape, 1, np.float32,
-                           np.float32, np.float32)
-            self.assertTrue(
-                False, "expected error for unavailable model " + model_name)
+            iu.infer_exact(
+                self, "graphdef", tensor_shape, 1, np.float32, np.float32, np.float32
+            )
+            self.assertTrue(False, "expected error for unavailable model " + model_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
         # And other models should be loaded successfully
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+            for base_name in ["savedmodel", "onnx"]:
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -268,13 +293,14 @@ def test_parse_error_no_model_config(self):
         tensor_shape = (1, 16)
 
         # Server was started but with a model that fails to be polled
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
-                model_name = tu.get_model_name('graphdef', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "graphdef", np.float32, np.float32, np.float32
+                )
 
                 # expecting ready because not strict readiness
                 self.assertTrue(triton_client.is_server_live())
@@ -282,29 +308,36 @@ def test_parse_error_no_model_config(self):
 
                 md = triton_client.get_model_metadata(model_name, "1")
                 self.assertTrue(
-                    False, "expected model '" + model_name +
-                    "' to be ignored due to polling failure")
+                    False,
+                    "expected model '"
+                    + model_name
+                    + "' to be ignored due to polling failure",
+                )
 
             except Exception as ex:
                 self.assertIn(
                     "Request for unknown model: 'graphdef_float32_float32_float32' is not found",
-                    ex.message())
+                    ex.message(),
+                )
 
         # And other models should be loaded successfully
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                model_name = tu.get_model_name(base_name, np.float32,
-                                               np.float32, np.float32)
+            for base_name in ["savedmodel", "onnx"]:
+                model_name = tu.get_model_name(
+                    base_name, np.float32, np.float32, np.float32
+                )
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
 
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -312,10 +345,10 @@ def test_init_error_modelfail(self):
         # --strict-readiness=true so server is live but not ready
 
         # Server was started but with models that fail to load
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertFalse(triton_client.is_server_ready())
@@ -330,24 +363,27 @@ def test_init_error_modelfail(self):
 
             # And other models should be loaded successfully
             try:
-                for base_name in ['graphdef', 'savedmodel', 'onnx']:
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
+                for base_name in ["graphdef", "savedmodel", "onnx"]:
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
                     self.assertTrue(triton_client.is_model_ready(model_name))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
             tensor_shape = (1, 16)
-            for base_name in ['graphdef', 'savedmodel', 'onnx']:
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               model_version=1)
+            for base_name in ["graphdef", "savedmodel", "onnx"]:
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    model_version=1,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -356,95 +392,105 @@ def test_parse_error_model_no_version(self):
         tensor_shape = (1, 16)
 
         # Server was started but with a model that fails to load
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertFalse(triton_client.is_server_ready())
 
-                model_name = tu.get_model_name('graphdef', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "graphdef", np.float32, np.float32, np.float32
+                )
                 self.assertFalse(triton_client.is_model_ready(model_name))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
             # Sanity check that other models are loaded properly
             try:
-                for base_name in ['savedmodel', 'onnx']:
-                    model_name = tu.get_model_name(base_name, np.float32,
-                                                   np.float32, np.float32)
+                for base_name in ["savedmodel", "onnx"]:
+                    model_name = tu.get_model_name(
+                        base_name, np.float32, np.float32, np.float32
+                    )
                     self.assertTrue(triton_client.is_model_ready(model_name))
                 for version in ["1", "3"]:
-                    model_name = tu.get_model_name("plan", np.float32,
-                                                   np.float32, np.float32)
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, version))
+                    model_name = tu.get_model_name(
+                        "plan", np.float32, np.float32, np.float32
+                    )
+                    self.assertTrue(triton_client.is_model_ready(model_name, version))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
-            for base_name in ['savedmodel', 'onnx']:
-                iu.infer_exact(self,
-                               base_name,
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=True)
+            for base_name in ["savedmodel", "onnx"]:
+                iu.infer_exact(
+                    self,
+                    base_name,
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=True,
+                )
             for version in [1, 3]:
-                iu.infer_exact(self,
-                               'plan',
-                               tensor_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=(version == 3),
-                               model_version=version)
+                iu.infer_exact(
+                    self,
+                    "plan",
+                    tensor_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=(version == 3),
+                    model_version=version,
+                )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
-            iu.infer_exact(self, 'graphdef', tensor_shape, 1, np.float32,
-                           np.float32, np.float32)
-            self.assertTrue(
-                False, "expected error for unavailable model " + model_name)
+            iu.infer_exact(
+                self, "graphdef", tensor_shape, 1, np.float32, np.float32, np.float32
+            )
+            self.assertTrue(False, "expected error for unavailable model " + model_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
     def test_parse_ignore_zero_prefixed_version(self):
         tensor_shape = (1, 16)
 
         # Server was started but only version 1 is loaded
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
 
-                model_name = tu.get_model_name('savedmodel', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "savedmodel", np.float32, np.float32, np.float32
+                )
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
             # swap=False for version 1
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=False)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=False,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -452,53 +498,54 @@ def test_parse_ignore_non_intergral_version(self):
         tensor_shape = (1, 16)
 
         # Server was started but only version 1 is loaded
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
 
-                model_name = tu.get_model_name('savedmodel', np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    "savedmodel", np.float32, np.float32, np.float32
+                )
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         try:
             # swap=False for version 1
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=False)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=False,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_model_load_unload(self):
         tensor_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         # Make sure savedmodel model is not in the status (because
         # initially it is not in the model repository)
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
             except Exception as ex:
@@ -509,16 +556,14 @@ def test_dynamic_model_load_unload(self):
         try:
             shutil.copytree(savedmodel_name, "models/" + savedmodel_name)
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -526,47 +571,58 @@ def test_dynamic_model_load_unload(self):
 
         # Run inference on the just loaded model
         try:
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Make sure savedmodel has execution stats
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats["model_stats"]), 2)
             for idx in range(len(stats["model_stats"])):
-                self.assertEqual(stats["model_stats"][idx]["name"],
-                                 savedmodel_name)
+                self.assertEqual(stats["model_stats"][idx]["name"], savedmodel_name)
                 if stats["model_stats"][idx]["version"] == "1":
                     self.assertEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
                 else:
                     self.assertNotEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
-
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
+
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats.model_stats), 2)
             for idx in range(len(stats.model_stats)):
                 self.assertEqual(stats.model_stats[idx].name, savedmodel_name)
                 if stats.model_stats[idx].version == "1":
                     self.assertEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
                 else:
                     self.assertNotEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
 
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
@@ -576,16 +632,14 @@ def test_dynamic_model_load_unload(self):
         try:
             shutil.rmtree("models/" + savedmodel_name)
             time.sleep(5)  # wait for model to unload
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -593,62 +647,65 @@ def test_dynamic_model_load_unload(self):
 
         # Model is removed so inference should fail
         try:
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
             self.assertTrue(
-                False,
-                "expected error for unavailable model " + savedmodel_name)
+                False, "expected error for unavailable model " + savedmodel_name
+            )
         except Exception as ex:
             self.assertIn(
-                "Request for unknown model: '{}' has no available versions".
-                format(savedmodel_name), ex.message())
+                "Request for unknown model: '{}' has no available versions".format(
+                    savedmodel_name
+                ),
+                ex.message(),
+            )
 
         # Add back the same model. The status/stats should be reset.
         try:
             shutil.copytree(savedmodel_name, "models/" + savedmodel_name)
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
 
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats["model_stats"]), 2)
             self.assertEqual(stats["model_stats"][0]["name"], savedmodel_name)
             self.assertEqual(stats["model_stats"][1]["name"], savedmodel_name)
             self.assertEqual(
-                stats["model_stats"][0]["inference_stats"]["success"]["count"],
-                0)
+                stats["model_stats"][0]["inference_stats"]["success"]["count"], 0
+            )
             self.assertEqual(
-                stats["model_stats"][1]["inference_stats"]["success"]["count"],
-                0)
+                stats["model_stats"][1]["inference_stats"]["success"]["count"], 0
+            )
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             stats = triton_client.get_inference_statistics(savedmodel_name)
             self.assertEqual(len(stats.model_stats), 2)
             self.assertEqual(stats.model_stats[0].name, savedmodel_name)
             self.assertEqual(stats.model_stats[1].name, savedmodel_name)
-            self.assertEqual(stats.model_stats[0].inference_stats.success.count,
-                             0)
-            self.assertEqual(stats.model_stats[1].inference_stats.success.count,
-                             0)
+            self.assertEqual(stats.model_stats[0].inference_stats.success.count, 0)
+            self.assertEqual(stats.model_stats[1].inference_stats.success.count, 0)
 
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
@@ -658,16 +715,14 @@ def test_dynamic_model_load_unload(self):
         try:
             shutil.rmtree("models/" + onnx_name)
             time.sleep(5)  # wait for model to unload
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertTrue(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertFalse(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -675,41 +730,41 @@ def test_dynamic_model_load_unload(self):
 
         # Model is removed so inference should fail
         try:
-            iu.infer_exact(self,
-                           'onnx',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
-            self.assertTrue(False,
-                            "expected error for unavailable model " + onnx_name)
+            iu.infer_exact(
+                self,
+                "onnx",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
+            self.assertTrue(False, "expected error for unavailable model " + onnx_name)
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'onnx_float32_float32_float32' has no available versions",
-                ex.message())
+                ex.message(),
+            )
 
     def test_dynamic_model_load_unload_disabled(self):
         tensor_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         # Make sure savedmodel model is not in the status (because
         # initially it is not in the model repository)
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
             except Exception as ex:
@@ -720,16 +775,14 @@ def test_dynamic_model_load_unload_disabled(self):
         try:
             shutil.copytree(savedmodel_name, "models/" + savedmodel_name)
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -737,37 +790,38 @@ def test_dynamic_model_load_unload_disabled(self):
 
         # Run inference which should fail because the model isn't there
         try:
-            iu.infer_exact(self,
-                           'savedmodel',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "savedmodel",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
             self.assertTrue(
-                False,
-                "expected error for unavailable model " + savedmodel_name)
+                False, "expected error for unavailable model " + savedmodel_name
+            )
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'savedmodel_float32_float32_float32' is not found",
-                ex.message())
+                ex.message(),
+            )
 
         # Remove one of the original models from the model repository.
         # Unloading is disabled so it should remain available in the status.
         try:
             shutil.rmtree("models/" + onnx_name)
             time.sleep(5)  # wait for model to unload (but it shouldn't)
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -776,84 +830,93 @@ def test_dynamic_model_load_unload_disabled(self):
         # Run inference to make sure model still being served even
         # though deleted from model repository
         try:
-            iu.infer_exact(self,
-                           'onnx',
-                           tensor_shape,
-                           1,
-                           np.float32,
-                           np.float32,
-                           np.float32,
-                           swap=True)
+            iu.infer_exact(
+                self,
+                "onnx",
+                tensor_shape,
+                1,
+                np.float32,
+                np.float32,
+                np.float32,
+                swap=True,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_version_load_unload(self):
         tensor_shape = (1, 16)
-        graphdef_name = tu.get_model_name('graphdef', np.int32, np.int32,
-                                          np.int32)
+        graphdef_name = tu.get_model_name("graphdef", np.int32, np.int32, np.int32)
 
         # There are 3 versions. Make sure that all have status and are
         # ready.
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Run inference on version 1 to make sure it is available
         try:
-            iu.infer_exact(self,
-                           'graphdef',
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           swap=False,
-                           model_version=1)
+            iu.infer_exact(
+                self,
+                "graphdef",
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                swap=False,
+                model_version=1,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Make sure only version 1 has execution stats in the status.
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             stats = triton_client.get_inference_statistics(graphdef_name)
             self.assertEqual(len(stats["model_stats"]), 3)
             for idx in range(len(stats["model_stats"])):
-                self.assertEqual(stats["model_stats"][idx]["name"],
-                                 graphdef_name)
+                self.assertEqual(stats["model_stats"][idx]["name"], graphdef_name)
                 if stats["model_stats"][idx]["version"] == "1":
                     self.assertNotEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
                 else:
                     self.assertEqual(
-                        stats["model_stats"][idx]["inference_stats"]["success"]
-                        ["count"], 0)
-
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+                        stats["model_stats"][idx]["inference_stats"]["success"][
+                            "count"
+                        ],
+                        0,
+                    )
+
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             stats = triton_client.get_inference_statistics(graphdef_name)
             self.assertEqual(len(stats.model_stats), 3)
             for idx in range(len(stats.model_stats)):
                 self.assertEqual(stats.model_stats[idx].name, graphdef_name)
                 if stats.model_stats[idx].version == "1":
                     self.assertNotEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
                 else:
                     self.assertEqual(
-                        stats.model_stats[idx].inference_stats.success.count, 0)
+                        stats.model_stats[idx].inference_stats.success.count, 0
+                    )
 
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
@@ -863,87 +926,81 @@ def test_dynamic_version_load_unload(self):
         try:
             shutil.rmtree("models/" + graphdef_name + "/1")
             time.sleep(5)  # wait for version to unload
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Version is removed so inference should fail
         try:
-            iu.infer_exact(self,
-                           'graphdef',
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           swap=False,
-                           model_version=1)
+            iu.infer_exact(
+                self,
+                "graphdef",
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                swap=False,
+                model_version=1,
+            )
             self.assertTrue(
-                False, "expected error for unavailable model " + graphdef_name)
+                False, "expected error for unavailable model " + graphdef_name
+            )
         except Exception as ex:
             self.assertIn(
                 "Request for unknown model: 'graphdef_int32_int32_int32' version 1 is not at ready state",
-                ex.message())
+                ex.message(),
+            )
 
         # Add another version to the model repository.
         try:
-            shutil.copytree("models/" + graphdef_name + "/2",
-                            "models/" + graphdef_name + "/7")
+            shutil.copytree(
+                "models/" + graphdef_name + "/2", "models/" + graphdef_name + "/7"
+            )
             time.sleep(5)  # wait for version to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "7"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "7"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_version_load_unload_disabled(self):
         tensor_shape = (1, 16)
-        graphdef_name = tu.get_model_name('graphdef', np.int32, np.int32,
-                                          np.int32)
+        graphdef_name = tu.get_model_name("graphdef", np.int32, np.int32, np.int32)
 
         # Add a new version to the model repository and give it time to
         # load. But it shouldn't load because dynamic loading is
         # disabled.
         try:
-            shutil.copytree("models/" + graphdef_name + "/2",
-                            "models/" + graphdef_name + "/7")
+            shutil.copytree(
+                "models/" + graphdef_name + "/2", "models/" + graphdef_name + "/7"
+            )
             time.sleep(5)  # wait for model to load
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "7"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "7"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -953,59 +1010,54 @@ def test_dynamic_version_load_unload_disabled(self):
         try:
             shutil.rmtree("models/" + graphdef_name + "/1")
             time.sleep(5)  # wait for version to unload (but it shouldn't)
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "1"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "2"))
-                self.assertTrue(triton_client.is_model_ready(
-                    graphdef_name, "3"))
-                self.assertFalse(
-                    triton_client.is_model_ready(graphdef_name, "7"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "1"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "2"))
+                self.assertTrue(triton_client.is_model_ready(graphdef_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(graphdef_name, "7"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Run inference to make sure model still being served even
         # though version deleted from model repository
         try:
-            iu.infer_exact(self,
-                           'graphdef',
-                           tensor_shape,
-                           1,
-                           np.int32,
-                           np.int32,
-                           np.int32,
-                           swap=False,
-                           model_version=1)
+            iu.infer_exact(
+                self,
+                "graphdef",
+                tensor_shape,
+                1,
+                np.int32,
+                np.int32,
+                np.int32,
+                swap=False,
+                model_version=1,
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_model_modify(self):
-        models_base = ('savedmodel', 'plan')
+        models_base = ("savedmodel", "plan")
         models_shape = ((1, 16), (1, 16))
         models = list()
         for m in models_base:
-            models.append(
-                tu.get_model_name(m, np.float32, np.float32, np.float32))
+            models.append(tu.get_model_name(m, np.float32, np.float32, np.float32))
 
         # Make sure savedmodel and plan are in the status
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1013,63 +1065,67 @@ def test_dynamic_model_modify(self):
         for version in (1, 3):
             for model_name, model_shape in zip(models_base, models_shape):
                 try:
-                    iu.infer_exact(self,
-                                   model_name,
-                                   model_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   swap=(version == 3),
-                                   model_version=version)
+                    iu.infer_exact(
+                        self,
+                        model_name,
+                        model_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        swap=(version == 3),
+                        model_version=version,
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Change the model configuration to use wrong label file
         for base_name, model_name in zip(models_base, models):
-            shutil.copyfile("config.pbtxt.wrong." + base_name,
-                            "models/" + model_name + "/config.pbtxt")
+            shutil.copyfile(
+                "config.pbtxt.wrong." + base_name,
+                "models/" + model_name + "/config.pbtxt",
+            )
 
         time.sleep(5)  # wait for models to reload
         for model_name in models:
             for model_name, model_shape in zip(models_base, models_shape):
                 try:
-                    iu.infer_exact(self,
-                                   model_name,
-                                   model_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   swap=(version == 3),
-                                   model_version=version,
-                                   output0_raw=False)
+                    iu.infer_exact(
+                        self,
+                        model_name,
+                        model_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        swap=(version == 3),
+                        model_version=version,
+                        output0_raw=False,
+                    )
                     self.assertTrue(
-                        False,
-                        "expected error for wrong label for " + model_name)
+                        False, "expected error for wrong label for " + model_name
+                    )
                 except AssertionError as ex:
-                    self.assertTrue("'label9" in str(ex) and "!=" in str(ex),
-                                    str(ex))
+                    self.assertTrue("'label9" in str(ex) and "!=" in str(ex), str(ex))
 
         # Change the model configuration to use correct label file and to have
         # the default version policy (so that only version 3) is available.
         for base_name, model_name in zip(models_base, models):
-            shutil.copyfile("config.pbtxt." + base_name,
-                            "models/" + model_name + "/config.pbtxt")
+            shutil.copyfile(
+                "config.pbtxt." + base_name, "models/" + model_name + "/config.pbtxt"
+            )
 
         time.sleep(5)  # wait for models to reload
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1077,56 +1133,58 @@ def test_dynamic_model_modify(self):
         # change in model policy makes that no longer available.
         for model_name, model_shape in zip(models_base, models_shape):
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=False,
-                               model_version=1)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=False,
+                    model_version=1,
+                )
                 self.assertTrue(
-                    False, "expected error for unavailable model " + model_name)
+                    False, "expected error for unavailable model " + model_name
+                )
             except Exception as ex:
                 self.assertIn("Request for unknown model", ex.message())
 
         # Version 3 should continue to work...
         for model_name, model_shape in zip(models_base, models_shape):
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=True,
-                               model_version=3)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=True,
+                    model_version=3,
+                )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_dynamic_file_delete(self):
-        models_base = ('savedmodel', 'plan')
+        models_base = ("savedmodel", "plan")
         models_shape = ((1, 16), (1, 16))
         models = list()
         for m in models_base:
-            models.append(
-                tu.get_model_name(m, np.float32, np.float32, np.float32))
+            models.append(tu.get_model_name(m, np.float32, np.float32, np.float32))
 
         # Make sure savedmodel and plan are in the status
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1134,15 +1192,17 @@ def test_dynamic_file_delete(self):
         for version in (1, 3):
             for model_name, model_shape in zip(models_base, models_shape):
                 try:
-                    iu.infer_exact(self,
-                                   model_name,
-                                   model_shape,
-                                   1,
-                                   np.float32,
-                                   np.float32,
-                                   np.float32,
-                                   swap=(version == 3),
-                                   model_version=version)
+                    iu.infer_exact(
+                        self,
+                        model_name,
+                        model_shape,
+                        1,
+                        np.float32,
+                        np.float32,
+                        np.float32,
+                        swap=(version == 3),
+                        model_version=version,
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1156,81 +1216,86 @@ def test_dynamic_file_delete(self):
         time.sleep(5)  # wait for models to reload
         for model_name in models:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Only version 3 (latest) should work...
         for model_name, model_shape in zip(models_base, models_shape):
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=True,
-                               model_version=3)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=True,
+                    model_version=3,
+                )
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
             try:
-                iu.infer_exact(self,
-                               model_name,
-                               model_shape,
-                               1,
-                               np.float32,
-                               np.float32,
-                               np.float32,
-                               swap=False,
-                               model_version=1)
+                iu.infer_exact(
+                    self,
+                    model_name,
+                    model_shape,
+                    1,
+                    np.float32,
+                    np.float32,
+                    np.float32,
+                    swap=False,
+                    model_version=1,
+                )
                 self.assertTrue(
-                    False,
-                    "expected error for unavailable model " + graphdef_name)
+                    False, "expected error for unavailable model " + graphdef_name
+                )
             except Exception as ex:
                 self.assertIn("Request for unknown model", ex.message())
 
     def test_multiple_model_repository_polling(self):
         model_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
 
         # Models should be loaded successfully and infer
         # successfully. Initially savedmodel only has version 1.
-        self._infer_success_models([
-            'savedmodel',
-        ], (1,), model_shape)
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "savedmodel",
+            ],
+            (1,),
+            model_shape,
+        )
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Add the savedmodel to the second model repository, should cause
         # it to be unloaded due to duplication
         shutil.copytree(savedmodel_name, "models_0/" + savedmodel_name)
         time.sleep(5)  # wait for models to reload
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Remove the savedmodel from the first model repository, the
         # model from the second model repository should be loaded
@@ -1238,87 +1303,96 @@ def test_multiple_model_repository_polling(self):
         # have versions 1 and 3.
         shutil.rmtree("models/" + savedmodel_name)
         time.sleep(5)  # wait for model to unload
-        self._infer_success_models(['savedmodel', 'graphdef', 'onnx'], (1, 3),
-                                   model_shape)
+        self._infer_success_models(
+            ["savedmodel", "graphdef", "onnx"], (1, 3), model_shape
+        )
 
     def test_multiple_model_repository_control(self):
         # similar to test_multiple_model_repository_polling, but the
         # model load/unload is controlled by the API
         model_shape = (1, 16)
-        savedmodel_name = tu.get_model_name('savedmodel', np.float32,
-                                            np.float32, np.float32)
-        model_bases = ['savedmodel', 'graphdef', 'onnx']
+        savedmodel_name = tu.get_model_name(
+            "savedmodel", np.float32, np.float32, np.float32
+        )
+        model_bases = ["savedmodel", "graphdef", "onnx"]
 
         # Initially models are not loaded
         for base in model_bases:
             try:
-                model_name = tu.get_model_name(base, np.float32, np.float32,
-                                               np.float32)
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                model_name = tu.get_model_name(base, np.float32, np.float32, np.float32)
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load all models, here we use GRPC
         for base in model_bases:
             try:
-                model_name = tu.get_model_name(base, np.float32, np.float32,
-                                               np.float32)
+                model_name = tu.get_model_name(base, np.float32, np.float32, np.float32)
                 triton_client = grpcclient.InferenceServerClient(
-                    "localhost:8001", verbose=True)
+                    "localhost:8001", verbose=True
+                )
                 triton_client.load_model(model_name)
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Models should be loaded successfully and infer
         # successfully. Initially savedmodel only has version 1.
-        self._infer_success_models([
-            'savedmodel',
-        ], (1,), model_shape)
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "savedmodel",
+            ],
+            (1,),
+            model_shape,
+        )
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Add the savedmodel to the second model repository. Because
         # not polling this doesn't change any model state, all models
         # are still loaded and available.
         shutil.copytree(savedmodel_name, "models_0/" + savedmodel_name)
-        self._infer_success_models([
-            'savedmodel',
-        ], (1,), model_shape)
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
-
-        # Reload savedmodel which will cause it to unload because it
-        # is in 2 model repositories. Use HTTP here.
-        try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+        self._infer_success_models(
+            [
+                "savedmodel",
+            ],
+            (1,),
+            model_shape,
+        )
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
+
+        # Load savedmodel again which should fail because it is now duplicated
+        # in 2 model repositories. Use HTTP here.
+        try:
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(savedmodel_name)
         except Exception as ex:
-            self.assertIn("failed to load '{}'".format(savedmodel_name),
-                          ex.message())
+            self.assertIn("failed to load '{}'".format(savedmodel_name), ex.message())
 
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(savedmodel_name, "3"))
+                # Unlike polling mode, the failed load on the duplicate model
+                # should NOT unload the existing versions in model control mode.
+                self.assertTrue(triton_client.is_model_ready(savedmodel_name, "1"))
+                # Version 3 did not exist in the first model repository, so
+                # it should still not be loaded.
+                self.assertFalse(triton_client.is_model_ready(savedmodel_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models(['graphdef', 'onnx'], (1, 3), model_shape)
+        self._infer_success_models(["graphdef", "onnx"], (1, 3), model_shape)
 
         # Remove the savedmodel from the first model repository and
         # explicitly load savedmodel. The savedmodel from the second
@@ -1326,20 +1400,23 @@ def test_multiple_model_repository_control(self):
         # model repository savedmodel should have versions 1 and 3.
         shutil.rmtree("models/" + savedmodel_name)
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
+            # Unload existing in-memory model from first model repository
+            triton_client.unload_model(savedmodel_name)
+            # Load model from second model repository since original was deleted
             triton_client.load_model(savedmodel_name)
         except Exception as ex:
-            self.assertIn("failed to load '{}'".format(savedmodel_name),
-                          ex.message())
+            self.assertIn("failed to load '{}'".format(savedmodel_name), ex.message())
 
-        self._infer_success_models(['savedmodel', 'graphdef', 'onnx'], (1, 3),
-                                   model_shape)
+        self._infer_success_models(
+            ["savedmodel", "graphdef", "onnx"], (1, 3), model_shape
+        )
 
     def test_model_control(self):
         model_shape = (1, 16)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         ensemble_prefix = "simple_"
         ensemble_name = ensemble_prefix + onnx_name
@@ -1347,48 +1424,55 @@ def test_model_control(self):
         # Make sure no models are loaded
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load non-existent model
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.load_model("unknown_model")
                 self.assertTrue(False, "expected unknown model failure")
             except Exception as ex:
                 self.assertIn(
-                    "failed to load 'unknown_model', no version is available",
-                    ex.message())
+                    "failed to load 'unknown_model', failed to poll from model repository",
+                    ex.message(),
+                )
 
         # Load ensemble model, the dependent model should be polled and loaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Delete model configuration for onnx, which will cause
         # the autofiller to use the latest version policy so that only
@@ -1396,51 +1480,65 @@ def test_model_control(self):
         for model_name in (onnx_name,):
             os.remove("models/" + model_name + "/config.pbtxt")
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Reload models, only version 3 should be available for onnx
         for model_name in (onnx_name, ensemble_name):
             try:
                 triton_client = grpcclient.InferenceServerClient(
-                    "localhost:8001", verbose=True)
+                    "localhost:8001", verbose=True
+                )
                 triton_client.load_model(model_name)
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         for model_name in (onnx_name,):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Unload non-existing model, nothing should happen
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.unload_model("unknown_model")
             except Exception as ex:
@@ -1449,24 +1547,23 @@ def test_model_control(self):
         # Unload the depending model, as side effect, the ensemble model will be
         # forced to be unloaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -1474,41 +1571,43 @@ def test_model_control(self):
         # model. The ensemble model should not be reloaded because it
         # was explicitly unloaded.
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(ensemble_name)
             triton_client.load_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
 
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_model_control_fail(self):
-        model_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                       np.float32)
+        model_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         # Make sure no models are loaded
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -1518,28 +1617,27 @@ def test_model_control_fail(self):
 
         # Request to load the model and expect fail to load
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(model_name)
             self.assertTrue(False, "expecting load failure")
         except InferenceServerException as ex:
-            self.assertIn("load failed for model '{}'".format(model_name),
-                          ex.message())
+            self.assertIn("load failed for model '{}'".format(model_name), ex.message())
 
         # Another attempt should fail as well
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(model_name)
             self.assertTrue(False, "expecting load failure")
         except InferenceServerException as ex:
-            self.assertIn("load failed for model '{}'".format(model_name),
-                          ex.message())
+            self.assertIn("load failed for model '{}'".format(model_name), ex.message())
 
     def test_model_control_ensemble(self):
         model_shape = (1, 16)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
 
         ensemble_prefix = "simple_"
         ensemble_name = ensemble_prefix + onnx_name
@@ -1547,83 +1645,91 @@ def test_model_control_ensemble(self):
         # Make sure no models are loaded
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load ensemble model, the dependent model should be polled and loaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Unload the ensemble with unload_dependents flag. all models should be unloaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(ensemble_name, unload_dependents=True)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
         for model_name in (onnx_name, ensemble_name):
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Load ensemble model, and unload it without unload_dependents flag (default).
         # The dependent model should still be available
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(ensemble_name)
             triton_client.unload_model(ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
 
         try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(ensemble_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(ensemble_name, "3"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "1"))
                 self.assertTrue(triton_client.is_model_ready(onnx_name, "3"))
         except Exception as ex:
@@ -1631,8 +1737,7 @@ def test_model_control_ensemble(self):
 
     def test_load_same_model_different_platform(self):
         model_shape = (1, 16)
-        model_name = tu.get_model_name('simple', np.float32, np.float32,
-                                       np.float32)
+        model_name = tu.get_model_name("simple", np.float32, np.float32, np.float32)
 
         # Check whether or not to use grpc protocol
         use_grpc = "TRITONSERVER_USE_GRPC" in os.environ
@@ -1646,19 +1751,22 @@ def test_load_same_model_different_platform(self):
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             if use_grpc:
-                metadata = triton_client.get_model_metadata(model_name,
-                                                            as_json=True)
+                metadata = triton_client.get_model_metadata(model_name, as_json=True)
             else:
                 metadata = triton_client.get_model_metadata(model_name)
             self.assertEqual(metadata["platform"], "tensorrt_plan")
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
-        self._infer_success_models([
-            "simple",
-        ], (
-            1,
-            3,
-        ), model_shape)
+        self._infer_success_models(
+            [
+                "simple",
+            ],
+            (
+                1,
+                3,
+            ),
+            model_shape,
+        )
 
         # Copy the same model of different platform to model repository
         shutil.rmtree("models/" + model_name)
@@ -1680,19 +1788,22 @@ def test_load_same_model_different_platform(self):
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
             self.assertTrue(triton_client.is_model_ready(model_name, "3"))
             if use_grpc:
-                metadata = triton_client.get_model_metadata(model_name,
-                                                            as_json=True)
+                metadata = triton_client.get_model_metadata(model_name, as_json=True)
             else:
                 metadata = triton_client.get_model_metadata(model_name)
             self.assertEqual(metadata["platform"], "pytorch_libtorch")
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
-        self._infer_success_models([
-            "simple",
-        ], (
-            1,
-            3,
-        ), model_shape)
+        self._infer_success_models(
+            [
+                "simple",
+            ],
+            (
+                1,
+                3,
+            ),
+            model_shape,
+        )
 
     def test_model_availability_on_reload(self):
         model_name = "identity_zero_1_int32"
@@ -1717,9 +1828,8 @@ def test_model_availability_on_reload(self):
 
         # Reload models, v1 should still be available until v2 is loaded
         # The load is requested in other thread as it is blocking API,
-        # and the v1 availibility should be tested during the reload
-        thread = threading.Thread(target=self._async_load,
-                                  args=(model_name, use_grpc))
+        # and the v1 availability should be tested during the reload
+        thread = threading.Thread(target=self._async_load, args=(model_name, use_grpc))
         thread.start()
         # wait for time < model creation delay to ensure load request is sent
         time.sleep(3)
@@ -1730,9 +1840,12 @@ def test_model_availability_on_reload(self):
             triton_client = self._get_client(use_grpc)
             self.assertTrue(triton_client.is_server_live())
             load_end = time.time()
-            self.assertTrue((load_end - load_start) < 5,
-                            "server was waiting unexpectly, waited {}".format(
-                                (load_end - load_start)))
+            self.assertTrue(
+                (load_end - load_start) < 5,
+                "server was waiting unexpectedly, waited {}".format(
+                    (load_end - load_start)
+                ),
+            )
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
         except Exception as ex:
@@ -1770,14 +1883,12 @@ def test_model_availability_on_reload_2(self):
         self._infer_success_identity(model_base, (1,), np.int32, model_shape)
 
         # Overwrite config.pbtxt to load v2 only
-        shutil.copyfile("config.pbtxt.v2",
-                        "models/" + model_name + "/config.pbtxt")
+        shutil.copyfile("config.pbtxt.v2", "models/" + model_name + "/config.pbtxt")
 
         # Reload models, v1 should still be available until v2 is loaded
         # The load is requested in other thread as it is blocking API,
-        # and the v1 availibility should be tested during the reload
-        thread = threading.Thread(target=self._async_load,
-                                  args=(model_name, use_grpc))
+        # and the v1 availability should be tested during the reload
+        thread = threading.Thread(target=self._async_load, args=(model_name, use_grpc))
         thread.start()
         # wait for time < model creation delay to ensure load request is sent
         time.sleep(3)
@@ -1788,9 +1899,12 @@ def test_model_availability_on_reload_2(self):
             triton_client = self._get_client(use_grpc)
             self.assertTrue(triton_client.is_server_live())
             load_end = time.time()
-            self.assertTrue((load_end - load_start) < 5,
-                            "server was waiting unexpectly, waited {}".format(
-                                (load_end - load_start)))
+            self.assertTrue(
+                (load_end - load_start) < 5,
+                "server was waiting unexpectedly, waited {}".format(
+                    (load_end - load_start)
+                ),
+            )
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
         except Exception as ex:
@@ -1828,13 +1942,11 @@ def test_model_availability_on_reload_3(self):
         self._infer_success_identity(model_base, (1,), np.int32, model_shape)
 
         # Overwrite config.pbtxt to load v2 only
-        shutil.copyfile("config.pbtxt.new",
-                        "models/" + model_name + "/config.pbtxt")
+        shutil.copyfile("config.pbtxt.new", "models/" + model_name + "/config.pbtxt")
 
         # Reload models, v1 will be reloaded but it should  be available
         # during the whole reload
-        thread = threading.Thread(target=self._async_load,
-                                  args=(model_name, use_grpc))
+        thread = threading.Thread(target=self._async_load, args=(model_name, use_grpc))
         thread.start()
         # wait for time < model creation delay to ensure load request is sent
         time.sleep(3)
@@ -1845,9 +1957,12 @@ def test_model_availability_on_reload_3(self):
             triton_client = self._get_client(use_grpc)
             self.assertTrue(triton_client.is_server_live())
             load_end = time.time()
-            self.assertTrue((load_end - load_start) < 5,
-                            "server was waiting unexpectly, waited {}".format(
-                                (load_end - load_start)))
+            self.assertTrue(
+                (load_end - load_start) < 5,
+                "server was waiting unexpectedly, waited {}".format(
+                    (load_end - load_start)
+                ),
+            )
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
         except Exception as ex:
@@ -1872,8 +1987,9 @@ def test_model_reload_fail(self):
 
         # Make sure version 1 of the model is loaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
@@ -1882,23 +1998,26 @@ def test_model_reload_fail(self):
         self._infer_success_identity(model_base, (1,), np.int32, model_shape)
 
         # Overwrite config.pbtxt to load v2 only on GPU, which will fail
-        shutil.copyfile("config.pbtxt.v2.gpu",
-                        "models/" + model_name + "/config.pbtxt")
+        shutil.copyfile("config.pbtxt.v2.gpu", "models/" + model_name + "/config.pbtxt")
 
         # Reload models, v1 should still be available even if v2 fails to load
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(model_name)
             self.assertTrue(False, "expecting load failure")
         except Exception as ex:
-            self.assertIn("version 2: Internal: GPU instances not supported",
-                          ex.message())
+            self.assertIn(
+                "version 2 is at UNAVAILABLE state: Internal: GPU instances not supported",
+                ex.message(),
+            )
 
         # Make sure version 1 of the model is available, and version 2 is not
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             self.assertTrue(triton_client.is_server_live())
             self.assertTrue(triton_client.is_server_ready())
             self.assertTrue(triton_client.is_model_ready(model_name, "1"))
@@ -1909,113 +2028,143 @@ def test_model_reload_fail(self):
 
     def test_multiple_model_repository_control_startup_models(self):
         model_shape = (1, 16)
-        onnx_name = tu.get_model_name('onnx', np.float32, np.float32,
-                                      np.float32)
-        plan_name = tu.get_model_name('plan', np.float32, np.float32,
-                                      np.float32)
+        onnx_name = tu.get_model_name("onnx", np.float32, np.float32, np.float32)
+        plan_name = tu.get_model_name("plan", np.float32, np.float32, np.float32)
 
         ensemble_prefix = "simple_"
         onnx_ensemble_name = ensemble_prefix + onnx_name
         plan_ensemble_name = ensemble_prefix + plan_name
 
         # Make sure unloaded models are not in the status
-        for base in ('savedmodel',):
-            model_name = tu.get_model_name(base, np.float32, np.float32,
-                                           np.float32)
+        for base in ("savedmodel",):
+            model_name = tu.get_model_name(base, np.float32, np.float32, np.float32)
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
         # And loaded models work properly
-        self._infer_success_models([
-            "onnx",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
 
         # Load non-existing model
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.load_model("unknown_model")
                 self.assertTrue(False, "expected unknown model failure")
             except Exception as ex:
                 self.assertIn(
-                    "failed to load 'unknown_model', no version is available",
-                    ex.message())
+                    "failed to load 'unknown_model', failed to poll from model repository",
+                    ex.message(),
+                )
 
         # Load plan ensemble model, the dependent model is already
         # loaded via command-line
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.load_model(plan_ensemble_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_plan",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_plan",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Delete model configuration, which will cause the autofiller
         # to use the latest version policy so that only version 3 will
         # be available if the models are re-loaded
         os.remove("models/" + onnx_name + "/config.pbtxt")
 
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_plan",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_plan",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
 
         # Reload onnx, only version 3 should be available
         try:
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             triton_client.load_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
-        self._infer_success_models([
-            "simple_onnx",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
-
-        try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_onnx",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
+
+        try:
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
                 self.assertFalse(triton_client.is_model_ready(onnx_name, "1"))
@@ -2023,10 +2172,10 @@ def test_multiple_model_repository_control_startup_models(self):
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Unload non-existing model, nothing should happen
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
             try:
                 triton_client.unload_model("unknown_model")
             except Exception as ex:
@@ -2035,24 +2184,23 @@ def test_multiple_model_repository_control_startup_models(self):
         # Unload the onnx, as side effect, the ensemble model
         # will be forced to be unloaded
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         for model_name in [onnx_name, onnx_ensemble_name]:
             try:
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
             except Exception as ex:
                 self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2060,36 +2208,46 @@ def test_multiple_model_repository_control_startup_models(self):
         # depending model. The ensemble model should not be reloaded
         # because it was explicitly unloaded.
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             triton_client.unload_model(onnx_ensemble_name)
             triton_client.load_model(onnx_name)
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
-        self._infer_success_models([
-            "onnx",
-        ], (3,), model_shape)
-        self._infer_success_models([
-            "plan",
-        ], (1, 3), model_shape)
-        self._infer_success_models([
-            "simple_plan",
-        ], (1, 3),
-                                   model_shape,
-                                   swap=True)
-
-        try:
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+        self._infer_success_models(
+            [
+                "onnx",
+            ],
+            (3,),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "plan",
+            ],
+            (1, 3),
+            model_shape,
+        )
+        self._infer_success_models(
+            [
+                "simple_plan",
+            ],
+            (1, 3),
+            model_shape,
+            swap=True,
+        )
+
+        try:
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 self.assertTrue(triton_client.is_server_live())
                 self.assertTrue(triton_client.is_server_ready())
-                self.assertFalse(
-                    triton_client.is_model_ready(onnx_ensemble_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(onnx_ensemble_name, "3"))
+                self.assertFalse(triton_client.is_model_ready(onnx_ensemble_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(onnx_ensemble_name, "3"))
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2097,7 +2255,7 @@ def test_model_repository_index(self):
         # use model control EXPLICIT and --load-model to load a subset of models
         # in model repository
         tensor_shape = (1, 16)
-        model_bases = ['graphdef', 'savedmodel', "simple_savedmodel"]
+        model_bases = ["graphdef", "savedmodel", "simple_savedmodel"]
 
         # Sanity check on loaded models
         # 3 models should be loaded:
@@ -2106,12 +2264,13 @@ def test_model_repository_index(self):
         #     graphdef_float32_float32_float32
         for model_base in model_bases:
             try:
-                model_name = tu.get_model_name(model_base, np.float32,
-                                               np.float32, np.float32)
-                for triton_client in (httpclient.InferenceServerClient(
-                        "localhost:8000", verbose=True),
-                                      grpcclient.InferenceServerClient(
-                                          "localhost:8001", verbose=True)):
+                model_name = tu.get_model_name(
+                    model_base, np.float32, np.float32, np.float32
+                )
+                for triton_client in (
+                    httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                    grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+                ):
                     self.assertTrue(triton_client.is_server_live())
                     self.assertTrue(triton_client.is_server_ready())
                     self.assertTrue(triton_client.is_model_ready(model_name))
@@ -2123,8 +2282,9 @@ def test_model_repository_index(self):
         # which appears in two repositories.
         model_bases.append("simple_graphdef")
         try:
-            triton_client = httpclient.InferenceServerClient("localhost:8000",
-                                                             verbose=True)
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
             index = triton_client.get_model_repository_index()
             indexed = list()
             self.assertEqual(len(index), 8)
@@ -2133,15 +2293,17 @@ def test_model_repository_index(self):
                 if i["name"] == "onnx_float32_float32_float32":
                     self.assertEqual(i["state"], "UNAVAILABLE")
                     self.assertEqual(
-                        i["reason"],
-                        "model appears in two or more repositories")
+                        i["reason"], "model appears in two or more repositories"
+                    )
             for model_base in model_bases:
-                model_name = tu.get_model_name(model_base, np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    model_base, np.float32, np.float32, np.float32
+                )
                 self.assertTrue(model_name in indexed)
 
-            triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                             verbose=True)
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
             index = triton_client.get_model_repository_index()
             indexed = list()
             self.assertEqual(len(index.models), 8)
@@ -2150,10 +2312,12 @@ def test_model_repository_index(self):
                 if i.name == "onnx_float32_float32_float32":
                     self.assertEqual(i.state, "UNAVAILABLE")
                     self.assertEqual(
-                        i.reason, "model appears in two or more repositories")
+                        i.reason, "model appears in two or more repositories"
+                    )
             for model_base in model_bases:
-                model_name = tu.get_model_name(model_base, np.float32,
-                                               np.float32, np.float32)
+                model_name = tu.get_model_name(
+                    model_base, np.float32, np.float32, np.float32
+                )
                 self.assertTrue(model_name in indexed)
 
         except Exception as ex:
@@ -2162,21 +2326,19 @@ def test_model_repository_index(self):
     def test_config_override(self):
         model_shape = (1, 16)
 
-        for triton_client in (httpclient.InferenceServerClient("localhost:8000",
-                                                               verbose=True),
-                              grpcclient.InferenceServerClient("localhost:8001",
-                                                               verbose=True)):
-            for base in (('onnx', 'onnxruntime'),):
-                model_name = tu.get_model_name(base[0], np.float32, np.float32,
-                                               np.float32)
+        for triton_client in (
+            httpclient.InferenceServerClient("localhost:8000", verbose=True),
+            grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+        ):
+            for base in (("onnx", "onnxruntime"),):
+                model_name = tu.get_model_name(
+                    base[0], np.float32, np.float32, np.float32
+                )
                 try:
                     self.assertTrue(triton_client.is_server_live())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "2"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2185,18 +2347,23 @@ def test_config_override(self):
                 try:
                     triton_client.load_model(model_name)
                     self.assertTrue(
-                        False, "expected fail to load '{}'".format(model_name))
+                        False, "expected fail to load '{}'".format(model_name)
+                    )
                 except Exception as ex:
                     self.assertIn(
-                        "load failed for model '{}'".format(model_name),
-                        ex.message())
+                        "load failed for model '{}'".format(model_name), ex.message()
+                    )
 
                 # Request to load the model with provided "correct" config
                 try:
-                    triton_client.load_model(model_name,
-                                             config="""
+                    triton_client.load_model(
+                        model_name,
+                        config="""
 {{"backend":"{backend}","version_policy":{{"specific" : {{ "versions": [2] }} }} }}
-""".format(backend=base[1]))
+""".format(
+                            backend=base[1]
+                        ),
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
@@ -2204,68 +2371,61 @@ def test_config_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "3"))
 
                 # And loaded models work properly
-                self._infer_success_models([
-                    base[0],
-                ], (2,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (2,),
+                    model_shape,
+                )
 
                 # request without additional config will load with default
                 # config and expect to fail, and version 2 will not be unloaded.
                 try:
                     triton_client.load_model(model_name)
                     self.assertTrue(
-                        False, "expected fail to load '{}'".format(model_name))
+                        False, "expected fail to load '{}'".format(model_name)
+                    )
                 except Exception as ex:
                     self.assertIn(
-                        "load failed for model '{}'".format(model_name),
-                        ex.message())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                        "load failed for model '{}'".format(model_name), ex.message()
+                    )
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "2"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
 
                 # Unload model for the next client iteration
                 try:
                     triton_client.unload_model(model_name)
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "2"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "3"))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
     def test_file_override(self):
-        import base64
-
         model_shape = (1, 16)
         override_base = "override_model"
 
-        for base in (('onnx', 'onnxruntime'),):
-            model_name = tu.get_model_name(base[0], np.float32, np.float32,
-                                           np.float32)
-            override_model_name = tu.get_model_name(override_base, np.float32,
-                                                    np.float32, np.float32)
+        for base in (("onnx", "onnxruntime"),):
+            model_name = tu.get_model_name(base[0], np.float32, np.float32, np.float32)
+            override_model_name = tu.get_model_name(
+                override_base, np.float32, np.float32, np.float32
+            )
 
             # Prepare override file
-            with open("models/{}/3/model.{}".format(model_name, base[0]),
-                      'rb') as f:
+            with open("models/{}/3/model.{}".format(model_name, base[0]), "rb") as f:
                 file_content = f.read()
 
-            for triton_client in (httpclient.InferenceServerClient(
-                    "localhost:8000", verbose=True),
-                                  grpcclient.InferenceServerClient(
-                                      "localhost:8001", verbose=True)):
+            for triton_client in (
+                httpclient.InferenceServerClient("localhost:8000", verbose=True),
+                grpcclient.InferenceServerClient("localhost:8001", verbose=True),
+            ):
                 try:
                     self.assertTrue(triton_client.is_server_live())
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "1"))
-                    self.assertFalse(
-                        triton_client.is_model_ready(model_name, "2"))
-                    self.assertTrue(
-                        triton_client.is_model_ready(model_name, "3"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "1"))
+                    self.assertFalse(triton_client.is_model_ready(model_name, "2"))
+                    self.assertTrue(triton_client.is_model_ready(model_name, "3"))
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2275,14 +2435,17 @@ def test_file_override(self):
                 # not be used.
                 try:
                     triton_client.load_model(
-                        model_name, files={"file:1/model.onnx": file_content})
-                    self.assertTrue(
-                        False, "expected error on missing override config")
+                        model_name, files={"file:1/model.onnx": file_content}
+                    )
+                    self.assertTrue(False, "expected error on missing override config")
                 except InferenceServerException as ex:
                     # [FIXME] Improve error reporting to mention missing config
                     self.assertIn(
-                        "failed to load '{}', failed to poll from model repository"
-                        .format(model_name), ex.message())
+                        "failed to load '{}', failed to poll from model repository".format(
+                            model_name
+                        ),
+                        ex.message(),
+                    )
 
                 # Sanity check on previous loaded version is still available
                 # after the failure attempt to load model with different version
@@ -2290,18 +2453,22 @@ def test_file_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertTrue(triton_client.is_model_ready(model_name, "3"))
 
-                self._infer_success_models([
-                    base[0],
-                ], (3,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (3,),
+                    model_shape,
+                )
 
                 # Request to load the model with override file and config in
                 # a different name
                 try:
                     triton_client.load_model(
                         override_model_name,
-                        config="""{{"backend":"{backend}" }}""".format(
-                            backend=base[1]),
-                        files={"file:1/model.onnx": file_content})
+                        config="""{{"backend":"{backend}" }}""".format(backend=base[1]),
+                        files={"file:1/model.onnx": file_content},
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2310,31 +2477,35 @@ def test_file_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertTrue(triton_client.is_model_ready(model_name, "3"))
-                self._infer_success_models([
-                    base[0],
-                ], (3,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (3,),
+                    model_shape,
+                )
 
                 # New override model should also be available
-                self.assertTrue(
-                    triton_client.is_model_ready(override_model_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "2"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "3"))
-                self._infer_success_models([
-                    override_base,
-                ], (1,),
-                                           model_shape,
-                                           swap=True)
+                self.assertTrue(triton_client.is_model_ready(override_model_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "2"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "3"))
+                self._infer_success_models(
+                    [
+                        override_base,
+                    ],
+                    (1,),
+                    model_shape,
+                    swap=True,
+                )
 
                 # Request to load the model with override file and config in
                 # original name
                 try:
                     triton_client.load_model(
                         model_name,
-                        config="""{{"backend":"{backend}" }}""".format(
-                            backend=base[1]),
-                        files={"file:1/model.onnx": file_content})
+                        config="""{{"backend":"{backend}" }}""".format(backend=base[1]),
+                        files={"file:1/model.onnx": file_content},
+                    )
                 except Exception as ex:
                     self.assertTrue(False, "unexpected error {}".format(ex))
 
@@ -2343,24 +2514,27 @@ def test_file_override(self):
                 self.assertTrue(triton_client.is_model_ready(model_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "3"))
-                self._infer_success_models([
-                    base[0],
-                ], (1,),
-                                           model_shape,
-                                           swap=True)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (1,),
+                    model_shape,
+                    swap=True,
+                )
 
                 # The model with different name should be available
-                self.assertTrue(
-                    triton_client.is_model_ready(override_model_name, "1"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "2"))
-                self.assertFalse(
-                    triton_client.is_model_ready(override_model_name, "3"))
-                self._infer_success_models([
-                    override_base,
-                ], (1,),
-                                           model_shape,
-                                           swap=True)
+                self.assertTrue(triton_client.is_model_ready(override_model_name, "1"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "2"))
+                self.assertFalse(triton_client.is_model_ready(override_model_name, "3"))
+                self._infer_success_models(
+                    [
+                        override_base,
+                    ],
+                    (1,),
+                    model_shape,
+                    swap=True,
+                )
 
                 # Reset model for the next client iteration
                 try:
@@ -2373,19 +2547,99 @@ def test_file_override(self):
                 self.assertFalse(triton_client.is_model_ready(model_name, "1"))
                 self.assertFalse(triton_client.is_model_ready(model_name, "2"))
                 self.assertTrue(triton_client.is_model_ready(model_name, "3"))
-                self._infer_success_models([
-                    base[0],
-                ], (3,), model_shape)
+                self._infer_success_models(
+                    [
+                        base[0],
+                    ],
+                    (3,),
+                    model_shape,
+                )
+
+    # Test that model load API file override can't be used to create files
+    # outside of any model directory.
+    def test_file_override_security(self):
+        # When using model load API, temporary model directories are created in
+        # a randomly generated /tmp/folderXXXXXX directory for the life of the
+        # model, and cleaned up on model unload.
+        model_basepath = "/tmp/folderXXXXXX"
+        if os.path.exists(model_basepath) and os.path.isdir(model_basepath):
+            shutil.rmtree(model_basepath)
+        os.makedirs(model_basepath)
+
+        # Set file override paths that try to escape out of model directory,
+        # and test both pre-existing and non-existent files.
+        root_home_dir = "/root"
+
+        # Relative paths
+        escape_dir_rel = os.path.join("..", "..", "root")
+        escape_dir_full = os.path.join(model_basepath, escape_dir_rel)
+        self.assertEqual(os.path.abspath(escape_dir_full), root_home_dir)
+
+        new_file_rel = os.path.join(escape_dir_rel, "new_dir", "test.txt")
+        self.assertFalse(os.path.exists(os.path.join(model_basepath, new_file_rel)))
+        existing_file_rel = os.path.join(escape_dir_rel, ".bashrc")
+        self.assertTrue(os.path.exists(os.path.join(model_basepath, existing_file_rel)))
+
+        # Symlinks
+        ## No easy way to inject symlink into generated temp model dir, so for
+        ## testing sake, make a fixed symlink path in /tmp.
+        escape_dir_symlink_rel = os.path.join("..", "escape_symlink")
+        escape_dir_symlink_full = "/tmp/escape_symlink"
+        self.assertEqual(
+            os.path.abspath(os.path.join(model_basepath, escape_dir_symlink_rel)),
+            escape_dir_symlink_full,
+        )
+        if os.path.exists(escape_dir_symlink_full):
+            os.unlink(escape_dir_symlink_full)
+        os.symlink(root_home_dir, escape_dir_symlink_full)
+        self.assertTrue(os.path.abspath(escape_dir_symlink_full), root_home_dir)
+
+        symlink_new_file_rel = os.path.join(
+            escape_dir_symlink_rel, "new_dir", "test.txt"
+        )
+        self.assertFalse(
+            os.path.exists(os.path.join(model_basepath, symlink_new_file_rel))
+        )
+        symlink_existing_file_rel = os.path.join(escape_dir_symlink_rel, ".bashrc")
+        self.assertTrue(
+            os.path.exists(os.path.join(model_basepath, symlink_existing_file_rel))
+        )
+
+        # Contents to try writing to file, though it should fail to be written
+        new_contents = "This shouldn't exist"
+        new_contents_b64 = base64.b64encode(new_contents.encode())
+
+        new_files = [new_file_rel, symlink_new_file_rel]
+        existing_files = [existing_file_rel, symlink_existing_file_rel]
+        all_files = new_files + existing_files
+        for filepath in all_files:
+            # minimal config to create a new model
+            config = json.dumps({"backend": "identity"})
+            files = {f"file:{filepath}": new_contents_b64}
+            with httpclient.InferenceServerClient("localhost:8000") as client:
+                with self.assertRaisesRegex(InferenceServerException, "failed to load"):
+                    client.load_model("new_model", config=config, files=files)
+
+        for rel_path in new_files:
+            # Assert new file wasn't created
+            self.assertFalse(os.path.exists(os.path.join(model_basepath, rel_path)))
+
+        for rel_path in existing_files:
+            # Read the existing file and make sure it's contents weren't overwritten
+            existing_file = os.path.join(model_basepath, rel_path)
+            self.assertTrue(os.path.exists(existing_file))
+            with open(existing_file) as f:
+                contents = f.read()
+                self.assertNotEqual(contents, new_contents)
 
     def test_shutdown_dynamic(self):
         model_shape = (1, 1)
         input_data = np.ones(shape=(1, 1), dtype=np.float32)
 
-        inputs = [grpcclient.InferInput('INPUT0', model_shape, "FP32")]
+        inputs = [grpcclient.InferInput("INPUT0", model_shape, "FP32")]
         inputs[0].set_data_from_numpy(input_data)
 
-        triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient("localhost:8001", verbose=True)
         model_name = "custom_zero_1_float32"
 
         # Send two requests as only requests held in scheduler are counted
@@ -2403,26 +2657,27 @@ def callback(user_data, result, error):
         request_count = 6
         async_results = []
         for _ in range(request_count):
-            triton_client.async_infer(model_name, inputs,
-                                      partial(callback, async_results))
+            triton_client.async_infer(
+                model_name, inputs, partial(callback, async_results)
+            )
         time.sleep(1)
 
         # Send signal to shutdown the server
-        os.kill(int(os.environ['SERVER_PID']), signal.SIGINT)
+        os.kill(int(os.environ["SERVER_PID"]), signal.SIGINT)
 
         # Send more requests and should be rejected
         try:
             triton_client.infer(model_name, inputs)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
             self.assertIn(
                 "Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
+                ex.message(),
+            )
 
         # Wait until the results are available in user_data
         time_out = 30
-        while ((len(async_results) < request_count) and time_out > 0):
+        while (len(async_results) < request_count) and time_out > 0:
             time_out = time_out - 1
             time.sleep(1)
 
@@ -2430,21 +2685,19 @@ def callback(user_data, result, error):
         for result in async_results:
             if type(result) == InferenceServerException:
                 raise result
-            output_data = result.as_numpy('OUTPUT0')
+            output_data = result.as_numpy("OUTPUT0")
             np.testing.assert_allclose(
-                output_data,
-                input_data,
-                err_msg='Inference result is not correct')
+                output_data, input_data, err_msg="Inference result is not correct"
+            )
 
     def test_shutdown_sequence(self):
         model_shape = (1, 1)
         input_data = np.ones(shape=(1, 1), dtype=np.int32)
 
-        inputs = [grpcclient.InferInput('INPUT', model_shape, "INT32")]
+        inputs = [grpcclient.InferInput("INPUT", model_shape, "INT32")]
         inputs[0].set_data_from_numpy(input_data)
 
-        triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient("localhost:8001", verbose=True)
         model_name = "custom_sequence_int32"
 
         # Send two requests as only requests held in scheduler are counted
@@ -2459,59 +2712,57 @@ def callback(user_data, result, error):
         request_count = 2
         async_results = []
         for i in range(request_count):
-            triton_client.async_infer(model_name,
-                                      inputs,
-                                      partial(callback, async_results),
-                                      sequence_id=(i + 1),
-                                      sequence_start=True)
+            triton_client.async_infer(
+                model_name,
+                inputs,
+                partial(callback, async_results),
+                sequence_id=(i + 1),
+                sequence_start=True,
+            )
         time.sleep(1)
 
         # Send signal to shutdown the server
-        os.kill(int(os.environ['SERVER_PID']), signal.SIGINT)
+        os.kill(int(os.environ["SERVER_PID"]), signal.SIGINT)
 
         # Send requests with different characteristic
-        # 1: New sequence with new seqeuence ID
-        try:
-            triton_client.infer(model_name,
-                                inputs,
-                                sequence_id=request_count,
-                                sequence_start=True)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+        # 1: New sequence with new sequence ID
+        try:
+            triton_client.infer(
+                model_name, inputs, sequence_id=request_count, sequence_start=True
+            )
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
             self.assertIn(
                 "Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
-        # 2: New sequence with existing seqeuence ID
-        try:
-            triton_client.infer(model_name,
-                                inputs,
-                                sequence_id=1,
-                                sequence_start=True)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+                ex.message(),
+            )
+        # 2: New sequence with existing sequence ID
+        try:
+            triton_client.infer(model_name, inputs, sequence_id=1, sequence_start=True)
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
             self.assertIn(
                 "Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
+                ex.message(),
+            )
         # 3: Continuing sequence
         try:
-            res = triton_client.infer(model_name,
-                                      inputs,
-                                      sequence_id=2,
-                                      sequence_end=True)
-            output_data = res.as_numpy('OUTPUT')
+            res = triton_client.infer(
+                model_name, inputs, sequence_id=2, sequence_end=True
+            )
+            output_data = res.as_numpy("OUTPUT")
             # Result are accumulated
             np.testing.assert_allclose(
                 output_data,
                 input_data + input_data,
-                err_msg='Inference result is not correct')
+                err_msg="Inference result is not correct",
+            )
         except Exception as ex:
             self.assertTrue(False, "unexpected error {}".format(ex))
 
         # Wait until the results are available in user_data
         time_out = 30
-        while ((len(async_results) < request_count) and time_out > 0):
+        while (len(async_results) < request_count) and time_out > 0:
             time_out = time_out - 1
             time.sleep(1)
 
@@ -2519,11 +2770,10 @@ def callback(user_data, result, error):
         for result in async_results:
             if type(result) == InferenceServerException:
                 raise result
-            output_data = result.as_numpy('OUTPUT')
+            output_data = result.as_numpy("OUTPUT")
             np.testing.assert_allclose(
-                output_data,
-                input_data,
-                err_msg='Inference result is not correct')
+                output_data, input_data, err_msg="Inference result is not correct"
+            )
 
         # Sleep 5 seconds for scheduler timeout to work and should
         # reduce the in-flight count
@@ -2533,11 +2783,10 @@ def test_shutdown_ensemble(self):
         model_shape = (1, 1)
         input_data = np.ones(shape=(1, 1), dtype=np.float32)
 
-        inputs = [grpcclient.InferInput('INPUT0', model_shape, "FP32")]
+        inputs = [grpcclient.InferInput("INPUT0", model_shape, "FP32")]
         inputs[0].set_data_from_numpy(input_data)
 
-        triton_client = grpcclient.InferenceServerClient("localhost:8001",
-                                                         verbose=True)
+        triton_client = grpcclient.InferenceServerClient("localhost:8001", verbose=True)
         model_name = "ensemble_zero_1_float32"
 
         # Send two requests as only requests held in scheduler are counted
@@ -2554,26 +2803,28 @@ def callback(user_data, result, error):
         request_count = 1
         async_results = []
         for _ in range(request_count):
-            triton_client.async_infer(model_name, inputs,
-                                      partial(callback, async_results))
+            triton_client.async_infer(
+                model_name, inputs, partial(callback, async_results)
+            )
         time.sleep(1)
 
         # Send signal to shutdown the server
-        os.kill(int(os.environ['SERVER_PID']), signal.SIGINT)
+        os.kill(int(os.environ["SERVER_PID"]), signal.SIGINT)
 
         # Send more requests and should be rejected
         try:
             triton_client.infer(model_name, inputs)
-            self.assertTrue(False,
-                            "expected error for new inference during shutdown")
+            self.assertTrue(False, "expected error for new inference during shutdown")
         except InferenceServerException as ex:
+            self.assertIn("in ensemble 'ensemble_zero_1_float32'", ex.message())
             self.assertIn(
-                "in ensemble 'ensemble_zero_1_float32', Server is stopping, scheduler for model has stopped accepting new inference requests",
-                ex.message())
+                "Server is stopping, scheduler for model has stopped accepting new inference requests",
+                ex.message(),
+            )
 
         # Wait until the results are available in user_data
         time_out = 10
-        while ((len(async_results) < request_count) and time_out > 0):
+        while (len(async_results) < request_count) and time_out > 0:
             time_out = time_out - 1
             time.sleep(1)
 
@@ -2581,12 +2832,428 @@ def callback(user_data, result, error):
         for result in async_results:
             if type(result) == InferenceServerException:
                 raise result
-            output_data = result.as_numpy('OUTPUT0')
+            output_data = result.as_numpy("OUTPUT0")
             np.testing.assert_allclose(
-                output_data,
-                input_data,
-                err_msg='Inference result is not correct')
+                output_data, input_data, err_msg="Inference result is not correct"
+            )
+
+    def test_load_gpu_limit(self):
+        model_name = "cuda_memory_consumer"
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+            triton_client.load_model(model_name + "_1")
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        # After the first load, the memory consumption should have exceeded
+        # the specified limit, load will fail
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+            triton_client.load_model(model_name + "_2")
+            self.assertTrue(False, "expected error for loading model")
+        except Exception as ex:
+            self.assertIn("memory limit set for GPU 0 has exceeded", ex.message())
+
+        # Load should work after explicitly unload model to free memory
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+            triton_client.unload_model(model_name + "_1")
+            triton_client.load_model(model_name + "_2")
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+    def test_concurrent_model_load_speedup(self):
+        # Initialize client
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        # Each model should have a loading delay of 10 seconds
+        model_pairs = [
+            ["identity_zero_1_int32_1", "identity_zero_1_int32_2"],
+            ["python_identity_fp32_1", "python_identity_fp32_2"],
+        ]
+        # Test each model pair for speed up
+        for model_pair in model_pairs:
+            # Load both models concurrently
+            threads = []
+            for model_name in model_pair:
+                threads.append(
+                    threading.Thread(
+                        target=triton_client.load_model, args=(model_name,)
+                    )
+                )
+            start_time = time.time()
+            for thread in threads:
+                thread.start()
+            for thread in threads:
+                thread.join()
+            end_time = time.time()
+            loading_time = end_time - start_time
+            # Each of the two models has a minimum loading delay of 10 seconds
+            # Speedup is observed when the concurrent loading time < 20 seconds
+            # but use a tighter bound of 15 seconds
+            self.assertLess(
+                loading_time, 15.0, "Concurrent loading speedup not observed"
+            )
+            # Concurrent loading time cannot be < 10 seconds
+            self.assertGreaterEqual(
+                loading_time, 10.0, "Invalid concurrent loading time"
+            )
+            # Make sure the models are loaded
+            self.assertTrue(triton_client.is_server_live())
+            self.assertTrue(triton_client.is_server_ready())
+            for model_name in model_pair:
+                self.assertTrue(triton_client.is_model_ready(model_name))
+
+    def test_concurrent_model_load(self):
+        # Initialize client
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        # Load same named model concurrently
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            # First load an 10 seconds delayed identity backend model
+            thread_1 = pool.submit(triton_client.load_model, "identity_model")
+            time.sleep(2)  # wait between loads
+            # Switch the model file to python backend
+            shutil.move("models", "models_v1")
+            shutil.move("models_v2", "models")
+            # Second load should be blocked until the first completes
+            thread_2 = pool.submit(triton_client.load_model, "identity_model")
+            # Both loads should succeed
+            thread_1.result()
+            thread_2.result()
+        # Check the model is ready
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertTrue(triton_client.is_model_ready("identity_model"))
+        # Check the finally loaded model is the second one
+        model_metadata = triton_client.get_model_metadata("identity_model")
+        self.assertEqual(model_metadata.platform, "python")
+
+    def test_concurrent_model_load_unload(self):
+        # Initialize client
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        # Load identity_zero_1_int32 and unload it while loading
+        # The unload operation should wait until the load is completed
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            load_thread = pool.submit(triton_client.load_model, "identity_zero_1_int32")
+            time.sleep(2)  # wait between load and unload
+            unload_thread = pool.submit(
+                triton_client.unload_model, "identity_zero_1_int32"
+            )
+            load_thread.result()
+            unload_thread.result()
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertFalse(triton_client.is_model_ready("identity_zero_1_int32"))
+        # Load ensemble_zero_1_float32 and unload its dependency while loading
+        # The unload operation should wait until the load is completed
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            load_thread = pool.submit(
+                triton_client.load_model, "ensemble_zero_1_float32"
+            )
+            time.sleep(2)  # wait between load and unload
+            unload_thread = pool.submit(
+                triton_client.unload_model, "custom_zero_1_float32"
+            )
+            load_thread.result()
+            unload_thread.result()
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertFalse(triton_client.is_model_ready("ensemble_zero_1_float32"))
+        self.assertFalse(triton_client.is_model_ready("custom_zero_1_float32"))
+        # Load both models and unload them concurrently
+        model_names = ["identity_zero_1_int32", "ensemble_zero_1_float32"]
+        for is_load in [True, False]:
+            action_fn = (
+                triton_client.load_model if is_load else triton_client.unload_model
+            )
+            with concurrent.futures.ThreadPoolExecutor() as pool:
+                threads = []
+                for model_name in model_names:
+                    threads.append(pool.submit(action_fn, model_name))
+                for thread in concurrent.futures.as_completed(threads):
+                    thread.result()
+            for model_name in model_names:
+                self.assertEqual(is_load, triton_client.is_model_ready(model_name))
+
+    def test_concurrent_same_model_load_unload_stress(self):
+        model_name = "identity_zero_1_int32"
+        num_threads = 32
+        num_iterations = 1024
+        try:
+            triton_client = grpcclient.InferenceServerClient(
+                "localhost:8001", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        load_fail_reasons = [
+            "unexpected miss in global map",
+            "no version is available",
+            "failed to poll from model repository",
+        ]
+        unload_fail_reasons = ["versions that are still available: 1"]
+        load_fail_messages = [
+            ("failed to load '" + model_name + "', " + reason)
+            for reason in load_fail_reasons
+        ]
+        unload_fail_messages = [
+            ("failed to unload '" + model_name + "', " + reason)
+            for reason in unload_fail_reasons
+        ]
+        global_exception_stats = {}  # { "exception message": number of occurrence }
+        load_before_unload_finish = [False]  # use list to access by reference
+
+        def _load_unload():
+            exception_stats = {}  # { "exception message": number of occurrence }
+            for i in range(num_iterations):
+                try:
+                    triton_client.load_model(model_name)
+                except InferenceServerException as ex:
+                    # Acceptable for an unload to happen after a load completes, only
+                    # before the load can verify its load state.
+                    error_message = ex.message()
+                    self.assertIn(error_message, load_fail_messages)
+                    if error_message not in exception_stats:
+                        exception_stats[error_message] = 0
+                    exception_stats[error_message] += 1
+                try:
+                    triton_client.unload_model(model_name)
+                except InferenceServerException as ex:
+                    # Acceptable for a load to happen after an unload completes, only
+                    # before the unload can verify its unload state.
+                    error_message = ex.message()
+                    self.assertIn(error_message, unload_fail_messages)
+                    if error_message not in exception_stats:
+                        exception_stats[error_message] = 0
+                    exception_stats[error_message] += 1
+                    load_before_unload_finish[0] = True
+            return exception_stats
+
+        with concurrent.futures.ThreadPoolExecutor() as pool:
+            threads = []
+            for i in range(num_threads):
+                threads.append(pool.submit(_load_unload))
+            for t in threads:
+                exception_stats = t.result()
+                for key, count in exception_stats.items():
+                    if key not in global_exception_stats:
+                        global_exception_stats[key] = 0
+                    global_exception_stats[key] += count
+
+        self.assertTrue(triton_client.is_server_live())
+        self.assertTrue(triton_client.is_server_ready())
+        self.assertTrue(
+            load_before_unload_finish[0],
+            "The test case did not replicate a load while async unloading. Consider increase concurrency.",
+        )
+
+        stats_path = "./test_concurrent_same_model_load_unload_stress.statistics.log"
+        with open(stats_path, mode="w", encoding="utf-8") as f:
+            f.write(str(global_exception_stats) + "\n")
+
+    def test_concurrent_model_instance_load_speedup(self):
+        # Initialize client
+        try:
+            triton_client = httpclient.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        models = ["identity_fp32"]
+        # Create 2 instances which each have a delay time of 10 seconds.
+        num_instances = 2
+        instance_group = [{"kind": "KIND_CPU", "count": num_instances}]
+        config = {"instance_group": instance_group}
+        for model in models:
+            # Instances should be loaded concurrently for supported backends
+            start_time = time.time()
+            try:
+                triton_client.load_model(model, config=json.dumps(config))
+            except Exception as ex:
+                self.assertTrue(False, "unexpected error {}".format(ex))
+            end_time = time.time()
+            loading_time = end_time - start_time
+            print(f"Time to load {num_instances} instances: {loading_time}")
+
+            # Each of the two models has a minimum loading delay of 10 seconds
+            # Speedup is observed when the concurrent loading time < 20 seconds
+            # but use a tighter bound of 15 seconds
+            self.assertLess(
+                loading_time, 15.0, "Concurrent loading speedup not observed"
+            )
+            # Concurrent loading time cannot be < 10 seconds
+            self.assertGreaterEqual(
+                loading_time, 10.0, "Invalid concurrent loading time"
+            )
+            # Make sure the models are loaded
+            self.assertTrue(triton_client.is_server_live())
+            self.assertTrue(triton_client.is_server_ready())
+            self.assertTrue(triton_client.is_model_ready(model))
+
+    def _call_with_timeout(self, callable, timeout_secs):
+        # Setup handler for timing out call
+        def timeout_handler(sig, frame):
+            raise TimeoutError()
+
+        signal.signal(signal.SIGALRM, timeout_handler)
+        signal.alarm(timeout_secs)
+        result = callable()
+        return result
+
+    def _call_with_expected_timeout(self, callable, timeout_secs=3):
+        # Call callable with expectation that it will timeout
+        try:
+            self._call_with_timeout(callable, timeout_secs)
+        except TimeoutError:
+            print("Inference timed out as expected.")
+            return
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+        else:
+            self.assertTrue(False, "unexpected success, call should've timed out.")
+
+    def _get_fp32_io(self, client_type):
+        # Config
+        input_names = ["INPUT0", "INPUT1"]
+        output_names = ["OUTPUT0", "OUTPUT1"]
+        dtype, dims, shape = ("TYPE_FP32", [-1, 16], [1, 16])
+        input_config = [
+            {"name": name, "data_type": dtype, "dims": dims} for name in input_names
+        ]
+        output_config = [
+            {"name": name, "data_type": dtype, "dims": dims} for name in output_names
+        ]
+        # Inputs
+        inputs = []
+        for name in input_names:
+            inputs.append(
+                client_type.InferInput(name, shape, dtype.replace("TYPE_", ""))
+            )
+            inputs[-1].set_data_from_numpy(np.ones(shape, dtype=np.float32))
+        return input_config, output_config, inputs
+
+    def test_concurrent_model_instance_load_sanity(self):
+        cpu, gpu = "KIND_CPU", "KIND_GPU"
+        default_kinds = [cpu, gpu]
+        backend_kinds = {"plan": [gpu], "openvino": [cpu]}
+        try:
+            client_type = httpclient
+            triton_client = client_type.InferenceServerClient(
+                "localhost:8000", verbose=True
+            )
+        except Exception as ex:
+            self.assertTrue(False, "unexpected error {}".format(ex))
+
+        backends = os.environ.get("PARALLEL_BACKENDS", "").split()
+        self.assertTrue(len(backends) > 0, "PARALLEL_BACKENDS wasn't set")
+
+        num_instances = 5
+        input_config, output_config, inputs = self._get_fp32_io(client_type)
+        for backend in backends:
+            model = tu.get_model_name(backend, np.float32, np.float32, np.float32)
+            kinds = backend_kinds.get(backend, default_kinds)
+            for kind in kinds:
+                with self.subTest(backend=backend, model=model, kind=kind):
+                    # Setup model config
+                    instance_group = {"kind": kind, "count": num_instances}
+                    # Disable batching to guarantee 1 request per instance
+                    # Configure sequence batching such that each instance cannot accept new requests
+                    # while it is busy with an ongoing sequence. This way we can guarantee sending 1 request to each instance.
+                    max_batch_size = 0
+                    sequence_timeout_secs = 10
+                    sequence_batching = {
+                        "direct": {},
+                        "max_sequence_idle_microseconds": sequence_timeout_secs
+                        * 1000000,
+                    }
+                    config = {
+                        "instance_group": instance_group,
+                        "max_batch_size": max_batch_size,
+                        "sequence_batching": sequence_batching,
+                        "input": input_config,
+                        "output": output_config,
+                    }
+                    print(
+                        f"~~~ Backend: [{backend}], Model: [{model}], Config: [{config}] ~~~"
+                    )
+                    # Load the model
+                    try:
+                        triton_client.load_model(model, config=json.dumps(config))
+                    except Exception as ex:
+                        self.assertTrue(False, "unexpected error {}".format(ex))
+
+                    # Make sure the model is loaded
+                    self.assertTrue(triton_client.is_server_live())
+                    self.assertTrue(triton_client.is_model_ready(model))
+                    print(
+                        "Model Repository Index after load:",
+                        triton_client.get_model_repository_index(),
+                    )
+
+                    # Test inference on each instance
+                    for i in range(1, num_instances + 1):
+                        try:
+                            triton_client.infer(
+                                model, inputs, sequence_id=i, sequence_start=True
+                            )
+                        except Exception as ex:
+                            self.assertTrue(
+                                False, "unexpected inference error {}".format(ex)
+                            )
+
+                    # Each instance should be busy until their sequence times out, so
+                    # an additional infer call should time out. If it doesn't time out, something
+                    # is wrong and the test should fail.
+                    callable = partial(
+                        triton_client.infer,
+                        model,
+                        inputs,
+                        sequence_id=num_instances + 1,
+                        sequence_start=True,
+                    )
+                    self._call_with_expected_timeout(callable, timeout_secs=3)
+
+                    # Unload the model
+                    try:
+                        triton_client.unload_model(model)
+                    except Exception as ex:
+                        self.assertTrue(False, "unexpected error {}".format(ex))
+
+                    # Allow server to fully unload model before next test iteration
+                    num_tries = 10
+                    for i in range(num_tries):
+                        if triton_client.is_server_ready():
+                            break
+                        print(
+                            f"[Attempt {i}] Server not ready yet, sleeping and retrying. Current repository index: {triton_client.get_model_repository_index()}"
+                        )
+                        time.sleep(6)
+                    print(
+                        "Model Repository Index after unload attempts:",
+                        triton_client.get_model_repository_index(),
+                    )
+                    self.assertTrue(triton_client.is_server_ready())
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     unittest.main()
diff --git a/qa/L0_lifecycle/test.sh b/qa/L0_lifecycle/test.sh
index 5a34798aa1..8c389d46ac 100755
--- a/qa/L0_lifecycle/test.sh
+++ b/qa/L0_lifecycle/test.sh
@@ -1,5 +1,5 @@
 #!/bin/bash
-# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -48,6 +48,21 @@ SERVER=/opt/tritonserver/bin/tritonserver
 TEST_RESULT_FILE='test_results.txt'
 source ../common/util.sh
 
+function check_unit_test() {
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Failed\n***"
+        RET=1
+    else
+        check_test_results $TEST_RESULT_FILE 1
+        if [ $? -ne 0 ]; then
+            cat $CLIENT_LOG
+            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            RET=1
+        fi
+    fi
+}
+
 RET=0
 rm -fr *.log
 
@@ -74,18 +89,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -109,18 +113,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -146,18 +139,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -183,18 +165,7 @@ sleep $SLEEP_TIME
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_noexit >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -206,7 +177,7 @@ LOG_IDX=$((LOG_IDX+1))
 rm -rf models
 mkdir models
 SERVER_ARGS="--model-repository=`pwd`/models"
-SERVER_LOG="./inference_server_$LOG_IDX.log"
+SERVER_LOG="./stub_inference_server_$LOG_IDX.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -215,6 +186,7 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 SAVED_SERVER_PID=$SERVER_PID
 SERVER_ARGS="--model-repository=`pwd`/models --http-port 8003 --metrics-port 8004"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server
 sleep $SLEEP_TIME
 # check server log for the warning messages
@@ -236,7 +208,7 @@ LOG_IDX=$((LOG_IDX+1))
 rm -rf models
 mkdir models
 SERVER_ARGS="--model-repository=`pwd`/models"
-SERVER_LOG="./inference_server_$LOG_IDX.log"
+SERVER_LOG="./stub_inference_server_$LOG_IDX.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -245,6 +217,7 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 SAVED_SERVER_PID=$SERVER_PID
 SERVER_ARGS="--model-repository=`pwd`/models --grpc-port 8003 --metrics-port 8004"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server
 sleep $SLEEP_TIME
 # check server log for the warning messages
@@ -267,7 +240,7 @@ LOG_IDX=$((LOG_IDX+1))
 rm -rf models
 mkdir models
 SERVER_ARGS="--model-repository=`pwd`/models"
-SERVER_LOG="./inference_server_$LOG_IDX.log"
+SERVER_LOG="./stub_inference_server_$LOG_IDX.log"
 run_server
 if [ "$SERVER_PID" == "0" ]; then
     echo -e "\n***\n*** Failed to start $SERVER\n***"
@@ -276,6 +249,7 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 SAVED_SERVER_PID=$SERVER_PID
 SERVER_ARGS="--model-repository=`pwd`/models --grpc-port 8003 --http-port 8004"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server
 sleep $SLEEP_TIME
 # check server log for the warning messages
@@ -365,7 +339,9 @@ done
 for i in onnx plan ; do
     cp -r $DATADIR/qa_model_repository/${i}_float32_float32_float32 models_0/.
 done
-rm models/graphdef_float32_float32_float32/*/*
+# Change the model files so that multiple versions will be loaded, and one of
+# the versions will fail to load and cause all other versions to be unloaded.
+rm models/graphdef_float32_float32_float32/3/*
 
 SERVER_ARGS="--model-repository=`pwd`/models --model-repository=`pwd`/models_0 \
              --exit-on-error=false --exit-timeout-secs=5"
@@ -383,18 +359,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_modelfail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -419,18 +384,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_modelfail_nostrict >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -449,8 +403,10 @@ for i in onnx plan ; do
 done
 rm models/graphdef_float32_float32_float32/config.pbtxt
 
+# Autocomplete should not be turned on for this test because it asserts an error was logged
+# when in strict model configuration mode.
 SERVER_ARGS="--model-repository=`pwd`/models --model-repository=`pwd`/models_0 \
-             --exit-on-error=false --exit-timeout-secs=5"
+             --exit-on-error=false --exit-timeout-secs=5 --strict-model-config=true"
 SERVER_LOG="./inference_server_$LOG_IDX.log"
 run_server_tolive
 if [ "$SERVER_PID" == "0" ]; then
@@ -465,18 +421,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_no_model_config >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -521,18 +466,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_init_error_modelfail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -566,18 +500,7 @@ wait_for_model_stable $SERVER_TIMEOUT
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_error_model_no_version >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -606,18 +529,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_ignore_zero_prefixed_version >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -652,18 +564,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_parse_ignore_non_intergral_version >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -698,18 +599,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_model_load_unload >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -738,18 +628,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_model_load_unload_disabled >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -777,18 +656,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_version_load_unload >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -817,18 +685,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_version_load_unload_disabled >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -863,18 +720,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_model_modify >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -902,18 +748,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_dynamic_file_delete >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -947,18 +782,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_polling >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -994,18 +818,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_control >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1037,18 +850,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_control >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1080,18 +882,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_control_fail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1123,18 +914,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_control_ensemble >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1177,18 +957,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_control_startup_models >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1231,18 +1000,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_multiple_model_repository_control_startup_models >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1252,8 +1010,8 @@ LOG_IDX=$((LOG_IDX+1))
 
 # Test loading all models on startup in EXPLICIT model control mode AND
 # an additional --load-model argument, it should fail
-rm -fr models 
-mkdir models 
+rm -fr models
+mkdir models
 for i in onnx ; do
     cp -r $DATADIR/qa_model_repository/${i}_float32_float32_float32 models/.
     sed -i "s/max_batch_size:.*/max_batch_size: 1/" models/${i}_float32_float32_float32/config.pbtxt
@@ -1280,6 +1038,34 @@ fi
 
 LOG_IDX=$((LOG_IDX+1))
 
+# Test loading a startup model that doesn't exist, it should fail
+rm -fr models && mkdir models
+INVALID_MODEL="does-not-exist"
+SERVER_ARGS="--model-repository=`pwd`/models \
+             --model-control-mode=explicit \
+             --strict-readiness=true \
+             --exit-on-error=true \
+             --load-model=${INVALID_MODEL}"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** Failed: $SERVER started successfully when it was expected to fail\n***"
+    echo -e "ERROR: Startup model [${INVALID_MODEL}] should have failed to load."
+    cat $SERVER_LOG
+    RET=1
+
+    kill $SERVER_PID
+    wait $SERVER_PID
+fi
+# check server log for the error messages to make sure they're printed
+if [ `grep -c "model not found in any model repository" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Server log ${SERVER_LOG} did not print model load failure for non-existent model\n***"
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+
+LOG_IDX=$((LOG_IDX+1))
+
 # LifeCycleTest.test_model_repository_index
 rm -fr models models_0 config.pbtxt.*
 mkdir models models_0
@@ -1313,18 +1099,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_repository_index >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1508,18 +1283,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_model_reload_fail >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1539,48 +1303,67 @@ for protocol in grpc http; do
     if [[ $protocol == "grpc" ]]; then
        export TRITONSERVER_USE_GRPC=1
     fi
-    rm -fr models simple_float32_float32_float32
-    mkdir models
-    # Prepare two models of different platforms, but with the same name
-    cp -r $DATADIR/qa_model_repository/plan_float32_float32_float32 models/simple_float32_float32_float32
-    sed -i "s/plan_float32_float32_float32/simple_float32_float32_float32/" models/simple_float32_float32_float32/config.pbtxt
-    cp -r $DATADIR/qa_model_repository/libtorch_float32_float32_float32 simple_float32_float32_float32
-    sed -i "s/libtorch_float32_float32_float32/simple_float32_float32_float32/" simple_float32_float32_float32/config.pbtxt
 
-    SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit \
-                 --load-model=simple_float32_float32_float32 \
-                 --exit-timeout-secs=5"
-    SERVER_LOG="./inference_server_$LOG_IDX.log"
-    run_server
-    if [ "$SERVER_PID" == "0" ]; then
-        echo -e "\n***\n*** Failed to start $SERVER\n***"
-        cat $SERVER_LOG
-        exit 1
-    fi
+    # The OS file system is more granular when determining modification time,
+    # the modification timestamp is updated when the file content is changed in
+    # place, but not updated when the file is copied or moved. With Triton, any
+    # operation that changes a file is a modification. Thus, preparing the
+    # models backward will test when a replacement model is having an earlier or
+    # equal modification timestamp than the current model, Triton must still
+    # detect the model is modified and proceed with model reload.
+    for prep_order in normal reverse; do
+        rm -fr models simple_float32_float32_float32
+        mkdir models
+        # Prepare two models of different platforms, but with the same name
+        if [[ $prep_order == "normal" ]]; then
+            # Prepare the TRT model first, then the pytorch model
+            cp -r $DATADIR/qa_model_repository/plan_float32_float32_float32 models/simple_float32_float32_float32
+            sed -i "s/plan_float32_float32_float32/simple_float32_float32_float32/" models/simple_float32_float32_float32/config.pbtxt
+            cp -r $DATADIR/qa_model_repository/libtorch_float32_float32_float32 simple_float32_float32_float32
+            sed -i "s/libtorch_float32_float32_float32/simple_float32_float32_float32/" simple_float32_float32_float32/config.pbtxt
+        else
+            # Prepare the pytorch model first, then the TRT model
+            cp -r $DATADIR/qa_model_repository/libtorch_float32_float32_float32 simple_float32_float32_float32
+            sed -i "s/libtorch_float32_float32_float32/simple_float32_float32_float32/" simple_float32_float32_float32/config.pbtxt
+            cp -r $DATADIR/qa_model_repository/plan_float32_float32_float32 models/simple_float32_float32_float32
+            sed -i "s/plan_float32_float32_float32/simple_float32_float32_float32/" models/simple_float32_float32_float32/config.pbtxt
+        fi
 
-    rm -f $CLIENT_LOG
-    set +e
-    python $LC_TEST LifeCycleTest.test_load_same_model_different_platform >>$CLIENT_LOG 2>&1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Failed\n***"
-        RET=1
-    else
-        check_test_results $TEST_RESULT_FILE 1
+        SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit \
+                    --load-model=simple_float32_float32_float32 \
+                    --exit-timeout-secs=5"
+        SERVER_LOG="./inference_server_$LOG_IDX.log"
+        run_server
+        if [ "$SERVER_PID" == "0" ]; then
+            echo -e "\n***\n*** Failed to start $SERVER\n***"
+            cat $SERVER_LOG
+            exit 1
+        fi
+
+        rm -f $CLIENT_LOG
+        set +e
+        python $LC_TEST LifeCycleTest.test_load_same_model_different_platform >>$CLIENT_LOG 2>&1
         if [ $? -ne 0 ]; then
             cat $CLIENT_LOG
-            echo -e "\n***\n*** Test Result Verification Failed\n***"
+            echo -e "\n***\n*** Test Failed\n***"
             RET=1
+        else
+            check_test_results $TEST_RESULT_FILE 1
+            if [ $? -ne 0 ]; then
+                cat $CLIENT_LOG
+                echo -e "\n***\n*** Test Result Verification Failed\n***"
+                RET=1
+            fi
         fi
-    fi
-    set -e
+        set -e
 
-    kill $SERVER_PID
-    wait $SERVER_PID
+        kill $SERVER_PID
+        wait $SERVER_PID
 
-    unset TRITONSERVER_USE_GRPC
+        LOG_IDX=$((LOG_IDX+1))
+    done
 
-    LOG_IDX=$((LOG_IDX+1))
+    unset TRITONSERVER_USE_GRPC
 done
 
 # Send HTTP request to control endpoint
@@ -1668,7 +1451,7 @@ fi
 set +e
 code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/notapi/v2`
 set -e
-if [ "$code" != "400" ]; then
+if [ "$code" != "404" ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
@@ -1676,7 +1459,7 @@ fi
 set +e
 code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/notapi`
 set -e
-if [ "$code" != "400" ]; then
+if [ "$code" != "404" ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
@@ -1684,7 +1467,7 @@ fi
 set +e
 code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/notapi/foo`
 set -e
-if [ "$code" != "400" ]; then
+if [ "$code" != "404" ]; then
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
@@ -1716,18 +1499,7 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_config_override >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1760,18 +1532,9 @@ fi
 rm -f $CLIENT_LOG
 set +e
 python $LC_TEST LifeCycleTest.test_file_override >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
+python $LC_TEST LifeCycleTest.test_file_override_security >>$CLIENT_LOG 2>&1
+check_unit_test
 set -e
 
 kill $SERVER_PID
@@ -1787,7 +1550,7 @@ mkdir models
 cp -r ../custom_models/custom_zero_1_float32 models/. && \
     mkdir -p models/custom_zero_1_float32/1 && \
     (cd models/custom_zero_1_float32 && \
-        echo "dynamic_batching {}" >> config.pbtxt 
+        echo "dynamic_batching {}" >> config.pbtxt
         echo "parameters [" >> config.pbtxt && \
         echo "{ key: \"execute_delay_ms\"; value: { string_value: \"5000\" }}" >> config.pbtxt && \
         echo "]" >> config.pbtxt)
@@ -1802,19 +1565,9 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
+# Server will be shutdown in test script, need to make PID available in script
 SERVER_PID=$SERVER_PID python $LC_TEST LifeCycleTest.test_shutdown_dynamic >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 # check server log
@@ -1846,19 +1599,9 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
+# Server will be shutdown in test script, need to make PID available in script
 SERVER_PID=$SERVER_PID python $LC_TEST LifeCycleTest.test_shutdown_sequence >>$CLIENT_LOG 2>&1
-if [ $? -ne 0 ]; then
-    cat $CLIENT_LOG
-    echo -e "\n***\n*** Test Failed\n***"
-    RET=1
-else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
-fi
+check_unit_test
 set -e
 
 # check server log
@@ -1886,7 +1629,7 @@ cp -r ensemble_zero_1_float32 models/. && \
 cp -r ../custom_models/custom_zero_1_float32 models/. && \
     mkdir -p models/custom_zero_1_float32/1 && \
     (cd models/custom_zero_1_float32 && \
-        echo "dynamic_batching {}" >> config.pbtxt 
+        echo "dynamic_batching {}" >> config.pbtxt
         echo "parameters [" >> config.pbtxt && \
         echo "{ key: \"execute_delay_ms\"; value: { string_value: \"5000\" }}" >> config.pbtxt && \
         echo "]" >> config.pbtxt)
@@ -1901,32 +1644,306 @@ if [ "$SERVER_PID" == "0" ]; then
 fi
 
 set +e
+# Server will be shutdown in test script, need to make PID available in script
 SERVER_PID=$SERVER_PID python $LC_TEST LifeCycleTest.test_shutdown_ensemble >>$CLIENT_LOG 2>&1
+check_unit_test
+set -e
+
+# check server log
+if [ `grep -c "Model 'ensemble_zero_1_float32' (version 1) has 1 in-flight inferences" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Expect logging for model and in-flight inference count\n***"
+    RET=1
+fi
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_load_gpu_limit
+# dependency of the Python model to be used
+pip install cuda-python
+rm -fr models config.pbtxt.*
+mkdir models
+cp -r ../python_models/cuda_memory_consumer models/cuda_memory_consumer_1 && \
+    cp -r ../python_models/cuda_memory_consumer models/cuda_memory_consumer_2
+
+# Negative testing
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-gpu-limit -1:0.6"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** unexpected start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+    kill $SERVER_PID
+    wait $SERVER_PID
+elif [ `grep -c "expects device ID >= 0, got -1" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Expect error on invalid device\n***"
+    RET=1
+fi
+
+LOG_IDX=$((LOG_IDX+1))
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-gpu-limit 0:-0.4"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" != "0" ]; then
+    echo -e "\n***\n*** unexpected start $SERVER\n***"
+    cat $SERVER_LOG
+    RET=1
+    kill $SERVER_PID
+    wait $SERVER_PID
+elif [ `grep -c "expects limit fraction to be in range \[0.0, 1.0\], got -0.4" $SERVER_LOG` == "0" ]; then
+    echo -e "\n***\n*** Expect error on invalid fraction\n***"
+    RET=1
+fi
+
+LOG_IDX=$((LOG_IDX+1))
+
+# Run server to stop model loading if > 60% of GPU 0 memory is used
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-gpu-limit 0:0.6"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_load_gpu_limit >>$CLIENT_LOG 2>&1
+check_unit_test
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_load_speedup
+rm -rf models
+mkdir models
+MODEL_NAME="identity_zero_1_int32"
+cp -r ${MODEL_NAME} models && mkdir -p models/${MODEL_NAME}/1
+cp -r models/${MODEL_NAME} models/${MODEL_NAME}_1 && \
+    sed -i "s/${MODEL_NAME}/${MODEL_NAME}_1/" models/${MODEL_NAME}_1/config.pbtxt
+mv models/${MODEL_NAME} models/${MODEL_NAME}_2 && \
+    sed -i "s/${MODEL_NAME}/${MODEL_NAME}_2/" models/${MODEL_NAME}_2/config.pbtxt
+MODEL_NAME="identity_fp32"
+cp -r ../python_models/${MODEL_NAME} models && (cd models/${MODEL_NAME} && \
+    mkdir 1 && mv model.py 1 && \
+    echo "    def initialize(self, args):" >> 1/model.py && \
+    echo "        import time" >> 1/model.py && \
+    echo "        time.sleep(10)" >> 1/model.py)
+cp -r models/${MODEL_NAME} models/python_${MODEL_NAME}_1 && \
+    sed -i "s/${MODEL_NAME}/python_${MODEL_NAME}_1/" models/python_${MODEL_NAME}_1/config.pbtxt
+mv models/${MODEL_NAME} models/python_${MODEL_NAME}_2 && \
+    sed -i "s/${MODEL_NAME}/python_${MODEL_NAME}_2/" models/python_${MODEL_NAME}_2/config.pbtxt
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_load_speedup >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_load
+rm -rf models models_v1 models_v2
+mkdir models models_v2
+cp -r identity_zero_1_int32 models/identity_model && \
+    (cd models/identity_model && \
+        mkdir 1 && \
+        sed -i "s/identity_zero_1_int32/identity_model/" config.pbtxt)
+cp -r ../python_models/identity_fp32 models_v2/identity_model && \
+    (cd models_v2/identity_model && \
+        mkdir 1 && mv model.py 1 && \
+        sed -i "s/identity_fp32/identity_model/" config.pbtxt)
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_load >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_load_unload
+rm -rf models
+mkdir models
+cp -r identity_zero_1_int32 models && mkdir -p models/identity_zero_1_int32/1
+cp -r ensemble_zero_1_float32 models && mkdir -p models/ensemble_zero_1_float32/1
+cp -r ../custom_models/custom_zero_1_float32 models/. && \
+    mkdir -p models/custom_zero_1_float32/1 && \
+    (cd models/custom_zero_1_float32 && \
+        echo "parameters [" >> config.pbtxt && \
+        echo "{ key: \"creation_delay_sec\"; value: { string_value: \"10\" }}" >> config.pbtxt && \
+        echo "]" >> config.pbtxt)
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_load_unload >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_same_model_load_unload_stress
+rm -rf models
+mkdir models
+cp -r identity_zero_1_int32 models && \
+    (cd models/identity_zero_1_int32 && \
+        mkdir 1 && \
+        sed -i "s/string_value: \"10\"/string_value: \"0\"/" config.pbtxt)
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --model-load-thread-count=32 --log-verbose=2"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_same_model_load_unload_stress >>$CLIENT_LOG 2>&1
 if [ $? -ne 0 ]; then
     cat $CLIENT_LOG
     echo -e "\n***\n*** Test Failed\n***"
     RET=1
 else
-    check_test_results $TEST_RESULT_FILE 1
-    if [ $? -ne 0 ]; then
-        cat $CLIENT_LOG
-        echo -e "\n***\n*** Test Result Verification Failed\n***"
-        RET=1
-    fi
+    cat ./test_concurrent_same_model_load_unload_stress.statistics.log
 fi
 set -e
 
-# check server log
-if [ `grep -c "Model 'ensemble_zero_1_float32' (version 1) has 1 in-flight inferences" $SERVER_LOG` == "0" ]; then
-    echo -e "\n***\n*** Expect logging for model and in-flight inference count\n***"
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_instance_load_speedup
+rm -rf models
+mkdir models
+MODEL_NAME="identity_fp32"
+cp -r ../python_models/${MODEL_NAME} models/ && (cd models/${MODEL_NAME} && \
+    mkdir 1 && mv model.py 1 && \
+    echo "    def initialize(self, args):" >> 1/model.py && \
+    echo "        import time" >> 1/model.py && \
+    echo "        time.sleep(10)" >> 1/model.py)
+rm models/${MODEL_NAME}/config.pbtxt
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+python $LC_TEST LifeCycleTest.test_concurrent_model_instance_load_speedup >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
     RET=1
 fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+LOG_IDX=$((LOG_IDX+1))
+
+# LifeCycleTest.test_concurrent_model_instance_load_sanity
+rm -rf models
+mkdir models
+# Sanity check loading multiple instances in parallel for each supported backend
+PARALLEL_BACKENDS="python onnx"
+for backend in ${PARALLEL_BACKENDS} ; do
+    model="${backend}_float32_float32_float32"
+    model_dir="models/${model}"
+    if [[ $backend == "python" ]]; then
+      cp -r ../python_models/identity_fp32 ${model_dir}
+      mkdir ${model_dir}/1 && mv ${model_dir}/model.py ${model_dir}/1
+      rm ${model_dir}/config.pbtxt
+    else
+      mkdir models/${model}
+      cp -r $DATADIR/qa_model_repository/${model}/1 models/${model}/1
+    fi
+done
+
+SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --log-verbose=2"
+SERVER_LOG="./inference_server_$LOG_IDX.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+PARALLEL_BACKENDS=${PARALLEL_BACKENDS} python $LC_TEST LifeCycleTest.test_concurrent_model_instance_load_sanity >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    echo -e "\n***\n*** Test Failed\n***"
+    RET=1
+fi
+set -e
 
 kill $SERVER_PID
 wait $SERVER_PID
 
 if [ $RET -eq 0 ]; then
   echo -e "\n***\n*** Test Passed\n***"
+else
+  echo -e "\n***\n*** Test Failed\n***"
 fi
 
 exit $RET
diff --git a/qa/L0_logging/logging_endpoint_test.py b/qa/L0_logging/logging_endpoint_test.py
new file mode 100755
index 0000000000..26f98de3da
--- /dev/null
+++ b/qa/L0_logging/logging_endpoint_test.py
@@ -0,0 +1,405 @@
+#!/usr/bin/python
+
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import sys
+
+sys.path.append("../common")
+
+import json
+import sys
+import unittest
+
+import test_util as tu
+import tritonclient.grpc as grpcclient
+import tritonclient.http as httpclient
+from google.protobuf import json_format
+
+
+# Similar set up as dynamic batcher tests
+class LogEndpointTest(tu.TestResultCollector):
+    def tearDown(self):
+        # Clear all log settings to initial state.
+        # Note that the tearDown function uses HTTP client so the pass/fail
+        # of the HTTP log setting test cases should be checked to make sure
+        # tearDown() is properly executed and not affecting start state of
+        # other test cases
+        clear_settings = {
+            "log_file": "",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        triton_client.update_log_settings(settings=clear_settings)
+
+    def check_server_initial_state(self):
+        # Helper function to make sure the log setting is properly
+        # initialized / reset before actually running the test case.
+        # Note that this function uses HTTP client so the pass/fail of
+        # the HTTP log setting test cases should be checked to make sure
+        # the initial state is checked properly before running other test cases.
+        initial_settings = {
+            "log_file": "",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        self.assertEqual(initial_settings, triton_client.get_log_settings())
+
+    def test_http_get_settings(self):
+        # Log settings will be the same as default settings since
+        # no update has been made.
+        initial_settings = {
+            "log_file": "",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_log_settings(),
+            "Unexpected initial log settings",
+        )
+
+    def test_grpc_get_settings(self):
+        # Log settings will be the same as default settings since
+        # no update has been made.
+        initial_settings = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": ""},
+                        "log_info": {"boolParam": True},
+                        "log_warning": {"boolParam": True},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            initial_settings,
+        )
+        triton_client = grpcclient.InferenceServerClient("localhost:8001")
+        self.assertEqual(
+            initial_settings,
+            triton_client.get_log_settings(),
+            "Unexpected initial log settings",
+        )
+
+    def test_http_update_settings(self):
+        # Update each possible log configuration
+        # field and check that they are reflected
+        # by the server
+        self.check_server_initial_state()
+
+        expected_log_settings_1 = {
+            "log_file": "log_file.log",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_2 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_3 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_4 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_5 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "default",
+        }
+        expected_log_settings_6 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "ISO8601",
+        }
+
+        triton_client = httpclient.InferenceServerClient("localhost:8000")
+        self.assertEqual(
+            expected_log_settings_1,
+            triton_client.update_log_settings(settings=expected_log_settings_1),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_2,
+            triton_client.update_log_settings(settings=expected_log_settings_2),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_3,
+            triton_client.update_log_settings(settings=expected_log_settings_3),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_4,
+            triton_client.update_log_settings(settings=expected_log_settings_4),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_5,
+            triton_client.update_log_settings(settings=expected_log_settings_5),
+            "Unexpected updated log settings",
+        )
+        self.assertEqual(
+            expected_log_settings_6,
+            triton_client.update_log_settings(settings=expected_log_settings_6),
+            "Unexpected updated log settings",
+        )
+
+    def test_grpc_update_settings(self):
+        # Update each possible log configuration
+        # field and check that they are reflected
+        # by the server
+        self.check_server_initial_state()
+        triton_client = grpcclient.InferenceServerClient("localhost:8001")
+
+        log_settings_1 = {
+            "log_file": "log_file.log",
+            "log_info": True,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_1 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": True},
+                        "log_warning": {"boolParam": True},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_1,
+        )
+
+        self.assertEqual(
+            expected_log_settings_1,
+            triton_client.update_log_settings(settings=log_settings_1),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_2 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": True,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_2 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": True},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_2,
+        )
+
+        self.assertEqual(
+            expected_log_settings_2,
+            triton_client.update_log_settings(settings=log_settings_2),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_3 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": True,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_3 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": True},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_3,
+        )
+
+        self.assertEqual(
+            expected_log_settings_3,
+            triton_client.update_log_settings(settings=log_settings_3),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_4 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 0,
+            "log_format": "default",
+        }
+        expected_log_settings_4 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": False},
+                        "log_verbose_level": {"uint32Param": 0},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_4,
+        )
+
+        self.assertEqual(
+            expected_log_settings_4,
+            triton_client.update_log_settings(settings=log_settings_4),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_5 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "default",
+        }
+        expected_log_settings_5 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": False},
+                        "log_verbose_level": {"uint32Param": 1},
+                        "log_format": {"stringParam": "default"},
+                    }
+                }
+            ),
+            expected_log_settings_5,
+        )
+
+        self.assertEqual(
+            expected_log_settings_5,
+            triton_client.update_log_settings(settings=log_settings_5),
+            "Unexpected updated log settings",
+        )
+
+        log_settings_6 = {
+            "log_file": "log_file.log",
+            "log_info": False,
+            "log_warning": False,
+            "log_error": False,
+            "log_verbose_level": 1,
+            "log_format": "ISO8601",
+        }
+        expected_log_settings_6 = grpcclient.service_pb2.LogSettingsResponse()
+        json_format.Parse(
+            json.dumps(
+                {
+                    "settings": {
+                        "log_file": {"stringParam": "log_file.log"},
+                        "log_info": {"boolParam": False},
+                        "log_warning": {"boolParam": False},
+                        "log_error": {"boolParam": False},
+                        "log_verbose_level": {"uint32Param": 1},
+                        "log_format": {"stringParam": "ISO8601"},
+                    }
+                }
+            ),
+            expected_log_settings_6,
+        )
+
+        self.assertEqual(
+            expected_log_settings_6,
+            triton_client.update_log_settings(settings=log_settings_6),
+            "Unexpected updated log settings",
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/qa/L0_logging/test.sh b/qa/L0_logging/test.sh
new file mode 100755
index 0000000000..d83e0b76a4
--- /dev/null
+++ b/qa/L0_logging/test.sh
@@ -0,0 +1,595 @@
+#!/bin/bash
+# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+SIMPLE_HTTP_CLIENT=../clients/simple_http_infer_client
+SIMPLE_GRPC_CLIENT=../clients/simple_grpc_infer_client
+
+CLIENT_TEST=logging_endpoint_test.py
+CLIENT_LOG="client.log"
+TEST_RESULT_FILE="test_results.txt"
+EXPECTED_NUM_TESTS="4"
+
+REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION}
+if [ "$#" -ge 1 ]; then
+    REPO_VERSION=$1
+fi
+if [ -z "$REPO_VERSION" ]; then
+    echo -e "Repository version must be specified"
+    echo -e "\n***\n*** Test Failed\n***"
+    exit 1
+fi
+if [ ! -z "$TEST_REPO_ARCH" ]; then
+    REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH}
+fi
+
+export CUDA_VISIBLE_DEVICES=0
+
+DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
+MODELBASE=onnx_int32_int32_int32
+
+MODELSDIR=`pwd`/log_models
+
+SERVER=/opt/tritonserver/bin/tritonserver
+source ../common/util.sh
+
+rm -f *.log
+rm -fr $MODELSDIR && mkdir -p $MODELSDIR
+
+# set up simple repository MODELBASE
+rm -fr $MODELSDIR && mkdir -p $MODELSDIR && \
+    cp -r $DATADIR/$MODELBASE $MODELSDIR/simple && \
+    rm -r $MODELSDIR/simple/2 && rm -r $MODELSDIR/simple/3 && \
+    (cd $MODELSDIR/simple && \
+            sed -i "s/^name:.*/name: \"simple\"/" config.pbtxt)
+RET=0
+
+function verify_correct_settings () {
+  log_file_expected=$1
+  log_info_expected=$2
+  log_warn_expected=$3
+  log_error_expected=$4
+  log_verbose_expected=$5
+  log_format_expected=$6
+  code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+  if [ `grep -c "\"log_file\":\"$log_file_expected"\" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log File Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_info\":$log_info_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Info Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_warning\":$log_warn_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Warn Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_error\":$log_error_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Error Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_verbose_level\":$log_verbose_expected" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Verbose Setting\n***"
+    RET=1
+  fi
+  if [ `grep -c "\"log_format\":\"$log_format_expected\"" ./curl.out` != "1" ]; then
+    echo -e "\n***\n*** Test Failed: Incorrect Log Format Setting\n***"
+    RET=1
+  fi
+}
+
+#Run Default Server
+SERVER_ARGS="--model-repository=$MODELSDIR"
+SERVER_LOG="./server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+# Check Default Settings
+rm -f ./curl.out
+set +e
+
+# Check if the current settings are returned [ file | info | warn | error | verbosity |format ]
+verify_correct_settings "" "true" "true" "true" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_default.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_default.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+# Check log is streaming to console by default
+console_count=($(wc -l ./server.log))
+if [ $console_count -le 30 ]; then
+    echo -e "\n***\n*** Test Failed: Log File Error\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log File (Argument)
+SERVER_ARGS="--log-file=log_file.log --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+
+verify_correct_settings "log_file.log" "true" "true" "true" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+expected_log_count=19
+actual_log_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* ./log_file.log)
+if [ $actual_log_count -lt $expected_log_count ]; then
+    echo $actual_log_count
+    echo $expected_log_count
+    echo -e "\n***\n*** Test Failed: Less Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+expected_server_count=0
+actual_server_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* inference_server_log_file.log)
+if [ $actual_server_count -gt $expected_server_count ]; then
+    echo $actual_server_count
+    echo $expected_server_count
+    echo -e "\n***\n*** Test Failed: More Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log File (Dynamic)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-verbose=1 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_file":"other_log.log"}' localhost:8000/v2/logging`
+set +e
+
+verify_correct_settings "other_log.log" "true" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_file.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+# Check redirection worked properly (server log has tolerance of 40 due to
+# unavoidable onnx framework logging)
+expected_log_count=75
+actual_log_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* ./log_file.log)
+if [ $actual_log_count -lt $expected_log_count ]; then
+    echo $actual_log_count
+    echo $expected_log_count
+    echo -e "\n***\n*** Test Failed: Less Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+expected_other_log_count=31
+actual_other_log_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* ./other_log.log)
+if [ $actual_other_log_count -lt $expected_other_log_count ]; then
+    echo $actual_other_log_count
+    echo $expected_other_log_count
+    echo -e "\n***\n*** Test Failed: Less Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+expected_server_count=0
+actual_server_count=$(grep -c ^[IWEV][0-9][0-9][0-9][0-9].* inference_server_log_file.log)
+if [ $actual_server_count -gt $expected_server_count ]; then
+    echo $actual_server_count
+    echo $expected_server_count
+    echo -e "\n***\n*** Test Failed: More Log Messages Than Expected $LINENO\n***"
+    RET=1
+fi
+
+set -e
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Info (Argument)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-info=false --log-verbose=1 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+verify_correct_settings "log_file.log" "false" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+# Test against guaranteed info message
+count=$(grep -c "Started HTTPService at" ./log_file.log)
+if [ $count -gt 0 ]; then
+    echo -e "\n***\n*** Test Failed: Info Message Not Expected $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+# Test Log Info (Dynamic)
+set +e
+rm -f ./curl.out
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_info":true}' localhost:8000/v2/logging`
+
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_info.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+set +e
+# Test against guaranteed info message
+count=$(grep -c "Waiting for in-flight requests to complete" ./log_file.log)
+if [ $count -ne 1 ]; then
+    echo -e "\n***\n*** Test Failed: Info Message Expected $LINENO\n***"
+    RET=1
+fi
+set -e
+
+# Test Log Warning
+SERVER_ARGS="--log-warning=false --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+verify_correct_settings "" "true" "false" "true" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_warning.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_warning.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Error
+SERVER_ARGS="--log-error=false --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+# Check if the current settings are returned [ file | info | warn | error | verbosity |format ]
+verify_correct_settings "" "true" "true" "false" "0" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_error.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_error.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Verbose Level (Argument)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-verbose=1 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "default"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_verbose.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_verbose.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+count=$(grep -c "/v2/logging" ./log_file.log)
+if [ $count -ne 2 ]; then
+    echo -e "\n***\n*** Test Failed: Verbose Message Expected $LINENO\n***"
+    RET=1
+fi
+
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":0}' localhost:8000/v2/logging`
+verify_correct_settings "log_file.log" "true" "true" "true" "0" "default"
+
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+count=$(grep -c "/v2/logging" ./log_file.log)
+if [ $count -gt 3 ]; then
+    echo -e "\n***\n*** Test Failed: Too Many Verbose Messages $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Log Format (Argument)
+rm -f log_file.log
+SERVER_ARGS="--log-file=log_file.log --log-verbose=1 --log-format=ISO8601 --model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_log_file.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+rm -f ./curl.out
+set +e
+code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/logging`
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "ISO8601"
+
+$SIMPLE_HTTP_CLIENT >> client_test_log_format.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+$SIMPLE_GRPC_CLIENT >> client_test_log_format.log 2>&1
+if [ $? -ne 0 ]; then
+    RET=1
+fi
+
+line=$(head -n 1 log_file.log)
+date=$(date '+%m%d')
+final_date="I${date}"
+format_date=$(echo $line | head -n1 | awk '{print $1;}')
+if [[ $final_date == $format_date ]]; then
+    echo -e "\n***\n*** Test Failed: Unexpected Log Format $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+# Test Log Format (Dynamic)
+set +e
+rm -f ./curl.out
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_format":"default"}' localhost:8000/v2/logging`
+verify_correct_settings "log_file.log" "true" "true" "true" "1" "default"
+
+line=$(tail -n 1 log_file.log)
+date=$(date '+%m%d')
+final_date="I${date}"
+format_date=$(echo $line | head -n1 | awk '{print $1;}')
+if [[ $final_date != $format_date ]]; then
+    echo -e "\n***\n*** Test Failed: Unexpected Log Format $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+#Test Negative Test Cases
+SERVER_ARGS="--log-warn="false" --model-repository=$MODELSDIR"
+SERVER_LOG="./server.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+BOOL_PARAMS=${BOOL_PARAMS:="log_info log_warning log_error"}
+for BOOL_PARAM in $BOOL_PARAMS; do
+    # Attempt to use integer instead of bool
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":1}' localhost:8000/v2/logging`
+    if [ "$code" != "400" ]; then
+        echo $code
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+    # Attempt to use upper-case bool
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":False}' localhost:8000/v2/logging`
+    if [ "$code" != "400" ]; then
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+    # Attempt to use string bool
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":"false"}' localhost:8000/v2/logging`
+    if [ "$code" != "400" ]; then
+        echo $code
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+    # Positive test case
+    code=`curl -s -w %{http_code} -o ./curl.out -d'{"'"$BOOL_PARAM"'":true}' localhost:8000/v2/logging`
+    if [ "$code" != "200" ]; then
+        cat ./curl.out
+        echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+        RET=1
+    fi
+done
+
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":-1}' localhost:8000/v2/logging`
+if [ "$code" != "400" ]; then
+    echo $code
+    cat ./curl.out
+    echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+    RET=1
+fi
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":"1"}' localhost:8000/v2/logging`
+if [ "$code" != "400" ]; then
+    echo $code
+    cat ./curl.out
+    echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+    RET=1
+fi
+code=`curl -s -w %{http_code} -o ./curl.out -d'{"log_verbose_level":0}' localhost:8000/v2/logging`
+if [ "$code" != "200" ]; then
+    echo $code
+    cat ./curl.out
+    echo -e "\n***\n*** Test Failed: Line: $LINENO\n***"
+    RET=1
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+# Test Python client library
+SERVER_ARGS="--model-repository=$MODELSDIR"
+SERVER_LOG="./inference_server_unittest.log"
+run_server
+if [ "$SERVER_PID" == "0" ]; then
+    echo -e "\n***\n*** Failed to start $SERVER\n***"
+    cat $SERVER_LOG
+    exit 1
+fi
+
+set +e
+
+python $CLIENT_TEST >>$CLIENT_LOG 2>&1
+if [ $? -ne 0 ]; then
+    cat $CLIENT_LOG
+    RET=1
+else
+    check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS
+    if [ $? -ne 0 ]; then
+        cat $CLIENT_LOG
+        echo -e "\n***\n*** Test Result Verification Failed\n***"
+        RET=1
+    fi
+fi
+
+set -e
+
+kill $SERVER_PID
+wait $SERVER_PID
+
+
+
+if [ $RET -eq 0 ]; then
+    echo -e "\n***\n*** Test Passed\n***"
+else
+    echo -e "\n***\n*** Test FAILED\n***"
+fi
+
+
+exit $RET
diff --git a/qa/L0_long_running_stress/crashing_client.py b/qa/L0_long_running_stress/crashing_client.py
old mode 100644
new mode 100755
index 81ce2e996e..d9c727a3d3
--- a/qa/L0_long_running_stress/crashing_client.py
+++ b/qa/L0_long_running_stress/crashing_client.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -25,29 +27,27 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
-from multiprocessing import Process, shared_memory
+import argparse
 import time
+from multiprocessing import Process, shared_memory
+
+import numpy as np
 import test_util as tu
-import argparse
 import tritonclient.grpc as grpcclient
 from tritonclient.utils import np_to_triton_dtype
 
 
-def crashing_client(model_name,
-                    dtype,
-                    tensor_shape,
-                    shm_name,
-                    triton_client,
-                    input_name="INPUT0"):
+def crashing_client(
+    model_name, dtype, tensor_shape, shm_name, triton_client, input_name="INPUT0"
+):
     in0 = np.random.random(tensor_shape).astype(dtype)
     if "libtorch" in model_name:
         input_name = "INPUT__0"
     inputs = [
-        grpcclient.InferInput(input_name, tensor_shape,
-                              np_to_triton_dtype(dtype)),
+        grpcclient.InferInput(input_name, tensor_shape, np_to_triton_dtype(dtype)),
     ]
     inputs[0].set_data_from_numpy(in0)
 
@@ -61,13 +61,15 @@ def crashing_client(model_name,
         results = triton_client.infer(model_name, inputs)
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-t',
-                        '--trial',
-                        type=str,
-                        required=True,
-                        help='Set trial for the crashing client')
+    parser.add_argument(
+        "-t",
+        "--trial",
+        type=str,
+        required=True,
+        help="Set trial for the crashing client",
+    )
     FLAGS = parser.parse_args()
     trial = FLAGS.trial
 
@@ -75,22 +77,23 @@ def crashing_client(model_name,
     model_name = tu.get_zero_model_name(trial, 1, dtype)
     tensor_shape = (1,) if "nobatch" in trial else (1, 1)
 
-    triton_client = grpcclient.InferenceServerClient(url="localhost:8001",
-                                                     verbose=True)
+    triton_client = grpcclient.InferenceServerClient(url="localhost:8001", verbose=True)
 
     shm = shared_memory.SharedMemory(create=True, size=8)
     count = np.ndarray((1,), dtype=np.int32, buffer=shm.buf)
     count[0] = 0
 
-    p = Process(target=crashing_client,
-                name="crashing_client",
-                args=(
-                    model_name,
-                    dtype,
-                    tensor_shape,
-                    shm.name,
-                    triton_client,
-                ))
+    p = Process(
+        target=crashing_client,
+        name="crashing_client",
+        args=(
+            model_name,
+            dtype,
+            tensor_shape,
+            shm.name,
+            triton_client,
+        ),
+    )
 
     p.start()
 
diff --git a/qa/L0_long_running_stress/scenarios.py b/qa/L0_long_running_stress/scenarios.py
old mode 100644
new mode 100755
index caae4fa12e..abb0004e90
--- a/qa/L0_long_running_stress/scenarios.py
+++ b/qa/L0_long_running_stress/scenarios.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -26,28 +28,31 @@
 
 import math
 import sys
+
 sys.path.append("../common")
 
-import numpy as np
-import time
-import test_util as tu
-import tritonclient.grpc as grpcclient
-from tritonclient.utils import np_to_triton_dtype
 import math
-from PIL import Image
 import os
 import subprocess
 import threading
+import time
+
+import numpy as np
+import test_util as tu
+import tritonclient.grpc as grpcclient
+from PIL import Image
+from tritonclient.utils import np_to_triton_dtype
+
 if sys.version_info >= (3, 0):
     import queue
 else:
     import Queue as queue
-from functools import partial
 
 import abc
 import csv
 import json
 import re
+from functools import partial
 
 DEFAULT_TIMEOUT_MS = 25000
 SEQUENCE_LENGTH_MEAN = 16
@@ -65,7 +70,6 @@ def completion_callback(user_data, result, error):
 
 
 class Scenario(metaclass=abc.ABCMeta):
-
     def __init__(self, name, trials, verbose=False, out_stream=sys.stdout):
         self.name_ = name
         self.trials_ = trials
@@ -108,13 +112,15 @@ class ModelOption:
         # 'queue_latency_range_us' specifies the range where queue latency
         # reported should be, otherwise, model concurrency will be adjusted
         # within 'concurrency_range' to influence the queue latency.
-        def __init__(self,
-                     model_name,
-                     batch_size,
-                     concurrency_range,
-                     queue_latency_range_us,
-                     input_shapes=[],
-                     input_file=None):
+        def __init__(
+            self,
+            model_name,
+            batch_size,
+            concurrency_range,
+            queue_latency_range_us,
+            input_shapes=[],
+            input_file=None,
+        ):
             self.model_name_ = model_name
             self.concurrency_range_ = list(concurrency_range)
             self.batch_size_ = batch_size
@@ -124,8 +130,11 @@ def __init__(self,
 
         def run(self, name, sequence_id_range, out_stream):
             csv_file = os.path.join(
-                "csv_dir", "{}_{}_{}.csv".format(name, self.model_name_,
-                                                 self.concurrency_range_[2]))
+                "csv_dir",
+                "{}_{}_{}.csv".format(
+                    name, self.model_name_, self.concurrency_range_[2]
+                ),
+            )
 
             arg_list = [PerfAnalyzerScenario.command_]
             # Always use GRPC streaming feature to ensure requests are handled
@@ -135,8 +144,9 @@ def run(self, name, sequence_id_range, out_stream):
             arg_list += ["-b", "{}".format(self.batch_size_)]
             arg_list += [
                 "--concurrency-range",
-                "{}:{}:1".format(self.concurrency_range_[2],
-                                 self.concurrency_range_[2])
+                "{}:{}:1".format(
+                    self.concurrency_range_[2], self.concurrency_range_[2]
+                ),
             ]
             arg_list += ["-f", csv_file]
             for name, shape in self.input_shapes_:
@@ -146,43 +156,44 @@ def run(self, name, sequence_id_range, out_stream):
             if sequence_id_range is not None:
                 arg_list += [
                     "--sequence-id-range",
-                    "{}:{}".format(sequence_id_range[0], sequence_id_range[1])
+                    "{}:{}".format(sequence_id_range[0], sequence_id_range[1]),
                 ]
 
-            completed_process = subprocess.run(arg_list,
-                                               text=True,
-                                               stdout=subprocess.PIPE,
-                                               stderr=subprocess.STDOUT)
+            completed_process = subprocess.run(
+                arg_list, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT
+            )
             # Write output to file before checking return code
             print(completed_process.stdout, file=out_stream)
             completed_process.check_returncode()
 
             # Read queue time and adjust concurrency
-            with open(csv_file, newline='') as csvfile:
+            with open(csv_file, newline="") as csvfile:
                 reader = csv.DictReader(csvfile)
                 for row in reader:
-                    current_queue_us = int(row['Server Queue'])
+                    current_queue_us = int(row["Server Queue"])
                     if current_queue_us < self.queue_latency_range_us_[0]:
                         self.concurrency_range_[2] = min(
-                            self.concurrency_range_[2] + 1,
-                            self.concurrency_range_[1])
+                            self.concurrency_range_[2] + 1, self.concurrency_range_[1]
+                        )
                     elif current_queue_us > self.queue_latency_range_us_[0]:
                         self.concurrency_range_[2] = max(
-                            self.concurrency_range_[2] - 1,
-                            self.concurrency_range_[0])
+                            self.concurrency_range_[2] - 1, self.concurrency_range_[0]
+                        )
                     break
-            m = re.search(r'Request count: ([0-9]+)', completed_process.stdout)
+            m = re.search(r"Request count: ([0-9]+)", completed_process.stdout)
             return int(m.group(1))
 
-    def __init__(self,
-                 name,
-                 rng,
-                 sequence_trials,
-                 identity_trials,
-                 queue_latency_range_us=(10000, 100000),
-                 sequence_id_range=None,
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(
+        self,
+        name,
+        rng,
+        sequence_trials,
+        identity_trials,
+        queue_latency_range_us=(10000, 100000),
+        sequence_id_range=None,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
         super().__init__(name, [], verbose, out_stream)
         self.rng_ = rng
         self.sequence_id_range_ = sequence_id_range
@@ -193,8 +204,10 @@ def __init__(self,
 
         # Add no validation models
         self.options_.append(
-            PerfAnalyzerScenario.ModelOption("resnet_v1_50_graphdef_def", 32,
-                                             (1, 4, 1), queue_latency_range_us))
+            PerfAnalyzerScenario.ModelOption(
+                "resnet_v1_50_graphdef_def", 32, (1, 4, 1), queue_latency_range_us
+            )
+        )
         for trial in sequence_trials:
             dtype = self.get_datatype(trial)
             # Skip string sequence model for now, it is hard for PA to generate
@@ -203,8 +216,10 @@ def __init__(self,
                 continue
             model_name = tu.get_sequence_model_name(trial, dtype)
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name, 1, (1, 4, 1),
-                                                 queue_latency_range_us))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name, 1, (1, 4, 1), queue_latency_range_us
+                )
+            )
         for trial in identity_trials:
             dtype = np.float32
             model_name = tu.get_zero_model_name(trial, 1, dtype)
@@ -213,9 +228,10 @@ def __init__(self,
             else:
                 input_shapes = [("INPUT0", "16")]
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name, 1, (1, 4, 1),
-                                                 queue_latency_range_us,
-                                                 input_shapes))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name, 1, (1, 4, 1), queue_latency_range_us, input_shapes
+                )
+            )
 
         # Add output validation version of the models
         # Skip resnet as the output data has variation which makes exact
@@ -223,25 +239,31 @@ def __init__(self,
         for trial in sequence_trials:
             dtype = self.get_datatype(trial)
             model_name = tu.get_sequence_model_name(trial, dtype)
-            data_file = os.path.join("validation_data",
-                                     "{}.json".format(model_name))
+            data_file = os.path.join("validation_data", "{}.json".format(model_name))
             self.generate_sequence_data(trial, dtype, data_file)
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name,
-                                                 1, (1, 4, 1),
-                                                 queue_latency_range_us,
-                                                 input_file=data_file))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name,
+                    1,
+                    (1, 4, 1),
+                    queue_latency_range_us,
+                    input_file=data_file,
+                )
+            )
         for trial in identity_trials:
             dtype = np.float32
             model_name = tu.get_zero_model_name(trial, 1, dtype)
-            data_file = os.path.join("validation_data",
-                                     "{}.json".format(model_name))
+            data_file = os.path.join("validation_data", "{}.json".format(model_name))
             self.generate_identity_data(trial, dtype, data_file)
             self.options_.append(
-                PerfAnalyzerScenario.ModelOption(model_name,
-                                                 1, (1, 4, 1),
-                                                 queue_latency_range_us,
-                                                 input_file=data_file))
+                PerfAnalyzerScenario.ModelOption(
+                    model_name,
+                    1,
+                    (1, 4, 1),
+                    queue_latency_range_us,
+                    input_file=data_file,
+                )
+            )
 
     def generate_sequence_data(self, trial, dtype, data_filename):
         input0 = "INPUT" if "libtorch" not in trial else "INPUT__0"
@@ -254,8 +276,7 @@ def generate_sequence_data(self, trial, dtype, data_filename):
             elif dtype == np.dtype(object):
                 res = str(i)
             else:
-                raise Exception(
-                    "unexpected sequence data type {}".format(dtype))
+                raise Exception("unexpected sequence data type {}".format(dtype))
             input_data.append({input0: [res]})
         output0 = "OUTPUT" if "libtorch" not in trial else "OUTPUT__0"
         output_data = []
@@ -271,8 +292,7 @@ def generate_sequence_data(self, trial, dtype, data_filename):
                 elif dtype == np.dtype(object):
                     res = str(sum)
                 else:
-                    raise Exception(
-                        "unexpected sequence data type {}".format(dtype))
+                    raise Exception("unexpected sequence data type {}".format(dtype))
                 output_data.append({output0: [res]})
         else:
             for i in range(3):
@@ -284,17 +304,17 @@ def generate_sequence_data(self, trial, dtype, data_filename):
                 elif dtype == np.dtype(object):
                     res = str(res)
                 else:
-                    raise Exception(
-                        "unexpected sequence data type {}".format(dtype))
+                    raise Exception("unexpected sequence data type {}".format(dtype))
                 output_data.append(
-                    {output0: [res if dtype != np.dtype(object) else str(res)]})
+                    {output0: [res if dtype != np.dtype(object) else str(res)]}
+                )
         data = {"data": [input_data]}
         data["validation_data"] = [output_data]
 
         # Only write to a file if there isn't validation file for the model
         PerfAnalyzerScenario.generation_mutex_.acquire()
         if not os.path.exists(data_filename):
-            with open(data_filename, 'w') as f:
+            with open(data_filename, "w") as f:
                 json.dump(data, f)
         PerfAnalyzerScenario.generation_mutex_.release()
 
@@ -310,43 +330,26 @@ def generate_identity_data(self, trial, dtype, data_filename):
             elif dtype == np.dtype(object):
                 res = str(i)
             else:
-                raise Exception(
-                    "unexpected identity data type {}".format(dtype))
+                raise Exception("unexpected identity data type {}".format(dtype))
             io_data.append(res)
         data = {
-            "data": [{
-                input0: {
-                    "content": io_data,
-                    "shape": [16]
-                }
-            }],
-            "validation_data": [{
-                output0: {
-                    "content": io_data,
-                    "shape": [16]
-                }
-            }]
+            "data": [{input0: {"content": io_data, "shape": [16]}}],
+            "validation_data": [{output0: {"content": io_data, "shape": [16]}}],
         }
         # Only write to a file if there isn't validation file for the model
         PerfAnalyzerScenario.generation_mutex_.acquire()
         if not os.path.exists(data_filename):
-            with open(data_filename, 'w') as f:
+            with open(data_filename, "w") as f:
                 json.dump(data, f)
         PerfAnalyzerScenario.generation_mutex_.release()
 
     def run(self, client_metadata):
         model_option = np.random.choice(self.options_)
-        return model_option.run(self.name_, self.sequence_id_range_,
-                                self.out_stream_)
+        return model_option.run(self.name_, self.sequence_id_range_, self.out_stream_)
 
 
 class ResNetScenario(Scenario):
-
-    def __init__(self,
-                 name,
-                 batch_size=32,
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(self, name, batch_size=32, verbose=False, out_stream=sys.stdout):
         super().__init__(name, [], verbose, out_stream)
         self.model_name_ = "resnet_v1_50_graphdef_def"
         self.batch_size_ = batch_size
@@ -359,7 +362,7 @@ def __init__(self,
 
     def preprocess(self, filename):
         img = Image.open(filename)
-        resized_img = img.convert('RGB').resize((224, 224), Image.BILINEAR)
+        resized_img = img.convert("RGB").resize((224, 224), Image.BILINEAR)
         np_img = np.array(resized_img).astype(np.float32)
         if np_img.ndim == 2:
             np_img = np_img[:, :, np.newaxis]
@@ -369,31 +372,35 @@ def preprocess(self, filename):
     def postprocess(self, results):
         output_array = results.as_numpy("resnet_v1_50/predictions/Softmax")
         if len(output_array) != self.batch_size_:
-            raise Exception("expected {} results, got {}".format(
-                self.batch_size_, len(output_array)))
+            raise Exception(
+                "expected {} results, got {}".format(
+                    self.batch_size_, len(output_array)
+                )
+            )
 
         for results in output_array:
             for result in results:
                 if output_array.dtype.type == np.object_:
-                    cls = "".join(chr(x) for x in result).split(':')
+                    cls = "".join(chr(x) for x in result).split(":")
                 else:
-                    cls = result.split(':')
+                    cls = result.split(":")
                 if cls[2] != "VULTURE":
                     raise Exception(
-                        "expected VULTURE as classification result, got {}".
-                        format(cls[2]))
+                        "expected VULTURE as classification result, got {}".format(
+                            cls[2]
+                        )
+                    )
 
     def run(self, client_metadata):
         triton_client = client_metadata[0]
 
-        inputs = [
-            grpcclient.InferInput("input", self.image_data_.shape, "FP32")
-        ]
+        inputs = [grpcclient.InferInput("input", self.image_data_.shape, "FP32")]
         inputs[0].set_data_from_numpy(self.image_data_)
 
         outputs = [
-            grpcclient.InferRequestedOutput("resnet_v1_50/predictions/Softmax",
-                                            class_count=1)
+            grpcclient.InferRequestedOutput(
+                "resnet_v1_50/predictions/Softmax", class_count=1
+            )
         ]
         res = triton_client.infer(self.model_name_, inputs, outputs=outputs)
         self.postprocess(res)
@@ -401,14 +408,15 @@ def run(self, client_metadata):
 
 
 class TimeoutScenario(Scenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 input_dtype=np.float32,
-                 input_name="INPUT0",
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(
+        self,
+        name,
+        trials,
+        input_dtype=np.float32,
+        input_name="INPUT0",
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
         super().__init__(name, trials, verbose, out_stream)
         self.input_dtype_ = input_dtype
         self.input_name_ = input_name
@@ -421,12 +429,16 @@ def run(self, client_metadata):
         if "librotch" in trial:
             input_name = "INPUT__0"
 
-        tensor_shape = (math.trunc(1 * (1024 * 1024 * 1024) //
-                                   np.dtype(self.input_dtype_).itemsize),)
+        tensor_shape = (
+            math.trunc(
+                1 * (1024 * 1024 * 1024) // np.dtype(self.input_dtype_).itemsize
+            ),
+        )
         in0 = np.random.random(tensor_shape).astype(self.input_dtype_)
         inputs = [
-            grpcclient.InferInput(input_name, tensor_shape,
-                                  np_to_triton_dtype(self.input_dtype_)),
+            grpcclient.InferInput(
+                input_name, tensor_shape, np_to_triton_dtype(self.input_dtype_)
+            ),
         ]
         inputs[0].set_data_from_numpy(in0)
 
@@ -442,12 +454,11 @@ def run(self, client_metadata):
 
 
 class CrashingScenario(Scenario):
-
     def __init__(self, name, verbose=False, out_stream=sys.stdout):
         super().__init__(name, [], verbose, out_stream)
 
     def run(self, client_metadata):
-        # Only use "custom" model as it simulates exectuion delay which
+        # Only use "custom" model as it simulates execution delay which
         # simplifies "crashing simulation" (client exits while request is being
         # executed)
         trial = "custom"
@@ -455,8 +466,7 @@ def run(self, client_metadata):
         # Call the client as subprocess to avoid crashing stress test
         # and gather logging as string variable
         crashing_client = "crashing_client.py"
-        log = subprocess.check_output(
-            [sys.executable, crashing_client, "-t", trial])
+        log = subprocess.check_output([sys.executable, crashing_client, "-t", trial])
         result = self.parse_result(log.decode("utf-8"))
         if not result[1]:
             assert False, "crashing_client failed {}".format(self.name_)
@@ -471,22 +481,20 @@ def parse_result(self, log):
         if "request_count:" in log:
             idx_start = log.rindex("request_count:")
             idx_start = log.find(" ", idx_start)
-            idx_end = log.find('\n', idx_start)
-            request_count = int(log[idx_start + 1:idx_end])
+            idx_end = log.find("\n", idx_start)
+            request_count = int(log[idx_start + 1 : idx_end])
 
         if "live:" in log:
             idx_start = log.rindex("live:")
             idx_start = log.find(" ", idx_start)
-            idx_end = log.find('\n', idx_start)
-            is_server_live = log[idx_start + 1:idx_end]
+            idx_end = log.find("\n", idx_start)
+            is_server_live = log[idx_start + 1 : idx_end]
 
         return (request_count, is_server_live == "true")
 
 
 class SequenceScenario(Scenario):
-
     class UserData:
-
         def __init__(self):
             self._completed_requests = queue.Queue()
 
@@ -497,51 +505,63 @@ def __init__(self):
     def check_constraints(self, model_name, sequence_id):
         pass
 
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
         super().__init__(name, trials, verbose, out_stream)
         self.rng_ = rng
         self.sequence_constraints_ = sequence_constraints
 
     def get_expected_result(self, expected_result, value, trial, flag_str=None):
         # Adjust the expected_result for models that
-        # couldn't implement the full accumulator. See
+        # could not implement the full accumulator. See
         # qa/common/gen_qa_sequence_models.py for more
         # information.
-        if (("nobatch" not in trial and
-             ("custom" not in trial)) or ("graphdef" in trial) or
-            ("plan" in trial) or ("onnx" in trial)) or ("libtorch" in trial):
+        if (
+            ("nobatch" not in trial and ("custom" not in trial))
+            or ("graphdef" in trial)
+            or ("plan" in trial)
+            or ("onnx" in trial)
+        ) or ("libtorch" in trial):
             expected_result = value
             if (flag_str is not None) and ("start" in flag_str):
                 expected_result += 1
         return expected_result
 
-    def check_sequence_async(self,
-                             client_metadata,
-                             trial,
-                             model_name,
-                             input_dtype,
-                             steps,
-                             timeout_ms=DEFAULT_TIMEOUT_MS,
-                             batch_size=1,
-                             sequence_name="",
-                             tensor_shape=(1,),
-                             input_name="INPUT",
-                             output_name="OUTPUT"):
+    def check_sequence_async(
+        self,
+        client_metadata,
+        trial,
+        model_name,
+        input_dtype,
+        steps,
+        timeout_ms=DEFAULT_TIMEOUT_MS,
+        batch_size=1,
+        sequence_name="",
+        tensor_shape=(1,),
+        input_name="INPUT",
+        output_name="OUTPUT",
+    ):
         """Perform sequence of inferences using async run. The 'steps' holds
         a list of tuples, one for each inference with format:
 
         (flag_str, value, expected_result, delay_ms)
 
         """
-        if (("savedmodel" not in trial) and ("graphdef" not in trial) and
-            ("custom" not in trial) and ("onnx" not in trial) and
-            ("libtorch" not in trial) and ("plan" not in trial)):
+        if (
+            ("savedmodel" not in trial)
+            and ("graphdef" not in trial)
+            and ("custom" not in trial)
+            and ("onnx" not in trial)
+            and ("libtorch" not in trial)
+            and ("plan" not in trial)
+        ):
             assert False, "unknown trial type: " + trial
 
         if "nobatch" not in trial:
@@ -565,28 +585,30 @@ def check_sequence_async(self,
             seq_start = False
             seq_end = False
             if flag_str is not None:
-                seq_start = ("start" in flag_str)
-                seq_end = ("end" in flag_str)
+                seq_start = "start" in flag_str
+                seq_end = "end" in flag_str
 
             if input_dtype == np.object_:
                 in0 = np.full(tensor_shape, value, dtype=np.int32)
-                in0n = np.array([str(x) for x in in0.reshape(in0.size)],
-                                dtype=object)
+                in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object)
                 in0 = in0n.reshape(tensor_shape)
             else:
                 in0 = np.full(tensor_shape, value, dtype=input_dtype)
 
             inputs = [
-                grpcclient.InferInput(input_name, tensor_shape,
-                                      np_to_triton_dtype(input_dtype)),
+                grpcclient.InferInput(
+                    input_name, tensor_shape, np_to_triton_dtype(input_dtype)
+                ),
             ]
             inputs[0].set_data_from_numpy(in0)
 
-            triton_client.async_stream_infer(model_name,
-                                             inputs,
-                                             sequence_id=sequence_id,
-                                             sequence_start=seq_start,
-                                             sequence_end=seq_end)
+            triton_client.async_stream_infer(
+                model_name,
+                inputs,
+                sequence_id=sequence_id,
+                sequence_start=seq_start,
+                sequence_end=seq_end,
+            )
             sent_count += 1
 
             if delay_ms is not None:
@@ -607,49 +629,62 @@ def check_sequence_async(self,
                 if (now_ms - seq_start_ms) > timeout_ms:
                     raise TimeoutException(
                         "Timeout expired for {}, got {} ms".format(
-                            sequence_name, (now_ms - seq_start_ms)))
-
-            result = results.as_numpy(
-                output_name)[0] if "nobatch" in trial else results.as_numpy(
-                    output_name)[0][0]
+                            sequence_name, (now_ms - seq_start_ms)
+                        )
+                    )
+
+            result = (
+                results.as_numpy(output_name)[0]
+                if "nobatch" in trial
+                else results.as_numpy(output_name)[0][0]
+            )
             if self.verbose_:
-                print("{} {}: + {} = {}".format(sequence_name, sequence_id,
-                                                value, result),
-                      file=self.out_stream_)
+                print(
+                    "{} {}: + {} = {}".format(
+                        sequence_name, sequence_id, value, result
+                    ),
+                    file=self.out_stream_,
+                )
 
             if expected is not None:
                 if input_dtype == np.object_:
-                    assert int(
-                        result
-                    ) == expected, "{}: expected result {}, got {} {} {}".format(
-                        sequence_name, expected, int(result), trial, model_name)
+                    assert (
+                        int(result) == expected
+                    ), "{}: expected result {}, got {} {} {}".format(
+                        sequence_name, expected, int(result), trial, model_name
+                    )
                 else:
-                    assert result == expected, "{}: expected result {}, got {} {} {}".format(
-                        sequence_name, expected, result, trial, model_name)
+                    assert (
+                        result == expected
+                    ), "{}: expected result {}, got {} {} {}".format(
+                        sequence_name, expected, result, trial, model_name
+                    )
         triton_client.stop_stream()
         return sent_count
 
 
 class SequenceNoEndScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -665,9 +700,10 @@ def run(self,
         # never ends. The sequence should be aborted by the server and its
         # slot reused for another sequence.
         seqlen = max(1, int(self.rng_.normal(len_mean, len_stddev)))
-        print("{} {}: no-end seqlen = {}".format(self.name_, client_metadata[1],
-                                                 seqlen),
-              file=self.out_stream_)
+        print(
+            "{} {}: no-end seqlen = {}".format(self.name_, client_metadata[1], seqlen),
+            file=self.out_stream_,
+        )
 
         values = self.rng_.randint(0, 1024 * 1024, size=seqlen).astype(dtype)
 
@@ -682,40 +718,42 @@ def run(self,
             val = values[idx]
             delay_ms = None
             expected_result += val
-            expected_result = self.get_expected_result(expected_result, val,
-                                                       trial, flags)
+            expected_result = self.get_expected_result(
+                expected_result, val, trial, flags
+            )
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, expected_result, delay_ms),)
+            steps.append(
+                (flags, val, expected_result, delay_ms),
+            )
 
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
 
 
 class SequenceValidNoEndScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -732,15 +770,18 @@ def run(self,
         # sequences use the same correlation ID and are sent back-to-back.
         seqlen = [
             max(1, int(self.rng_.normal(len_mean, len_stddev))),
-            max(1, int(self.rng_.normal(len_mean, len_stddev)))
+            max(1, int(self.rng_.normal(len_mean, len_stddev))),
         ]
-        print("{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format(
-            self.name_, client_metadata[1], seqlen[0], seqlen[1]),
-              file=self.out_stream_)
+        print(
+            "{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format(
+                self.name_, client_metadata[1], seqlen[0], seqlen[1]
+            ),
+            file=self.out_stream_,
+        )
 
         values = [
             self.rng_.randint(0, 1024 * 1024, size=seqlen[0]).astype(dtype),
-            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype)
+            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype),
         ]
 
         for p in [0, 1]:
@@ -758,39 +799,41 @@ def run(self,
                 delay_ms = None
                 expected_result += val
                 expected_result = self.get_expected_result(
-                    expected_result, val, trial, flags)
+                    expected_result, val, trial, flags
+                )
 
                 # (flag_str, value, expected_result, delay_ms)
-                steps.append((flags, val, expected_result, delay_ms),)
+                steps.append(
+                    (flags, val, expected_result, delay_ms),
+                )
 
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
 
 
 class SequenceValidValidScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -807,15 +850,18 @@ def run(self,
         # sent back-to-back.
         seqlen = [
             max(1, int(self.rng_.normal(len_mean, len_stddev))),
-            max(1, int(self.rng_.normal(len_mean, len_stddev)))
+            max(1, int(self.rng_.normal(len_mean, len_stddev))),
         ]
-        print("{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format(
-            self.name_, client_metadata[1], seqlen[0], seqlen[1]),
-              file=self.out_stream_)
+        print(
+            "{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format(
+                self.name_, client_metadata[1], seqlen[0], seqlen[1]
+            ),
+            file=self.out_stream_,
+        )
 
         values = [
             self.rng_.randint(0, 1024 * 1024, size=seqlen[0]).astype(dtype),
-            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype)
+            self.rng_.randint(0, 1024 * 1024, size=seqlen[1]).astype(dtype),
         ]
 
         for p in [0, 1]:
@@ -833,30 +879,30 @@ def run(self,
                 delay_ms = None
                 expected_result += val
                 expected_result = self.get_expected_result(
-                    expected_result, val, trial, flags)
+                    expected_result, val, trial, flags
+                )
 
                 # (flag_str, value, expected_result, delay_ms)
-                steps.append((flags, val, expected_result, delay_ms),)
+                steps.append(
+                    (flags, val, expected_result, delay_ms),
+                )
 
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
 
 
 class SequenceNoStartScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # no-start cannot follow no-end since the server will
@@ -864,7 +910,8 @@ def check_constraints(self, model_name, sequence_id):
         # the no-end sequence instead of being a sequence
         # missing start flag.
         if (model_name in self.sequence_constraints_) and (
-                sequence_id in self.sequence_constraints_[model_name]):
+            sequence_id in self.sequence_constraints_[model_name]
+        ):
             return not self.sequence_constraints_[model_name][sequence_id]
         return True
 
@@ -883,9 +930,12 @@ def run(self, client_metadata):
         # Create a sequence without a "start" flag. Sequence should get an
         # error from the server.
         seqlen = 1
-        print("{} {}: no-start seqlen = {}".format(self.name_,
-                                                   client_metadata[1], seqlen),
-              file=self.out_stream_)
+        print(
+            "{} {}: no-start seqlen = {}".format(
+                self.name_, client_metadata[1], seqlen
+            ),
+            file=self.out_stream_,
+        )
 
         values = self.rng_.randint(0, 1024 * 1024, size=seqlen).astype(dtype)
 
@@ -897,11 +947,12 @@ def run(self, client_metadata):
             delay_ms = None
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, None, delay_ms),)
+            steps.append(
+                (flags, val, None, delay_ms),
+            )
 
         try:
-            self.check_sequence_async(client_metadata, trial, model_name, dtype,
-                                      steps)
+            self.check_sequence_async(client_metadata, trial, model_name, dtype, steps)
             # Hit this point if sending no-start sequence to sequence id that
             # was used for no-end sequence and that means the constraints check
             # is inaccurate
@@ -914,25 +965,27 @@ def run(self, client_metadata):
 
 
 class SequenceValidScenario(SequenceScenario):
-
-    def __init__(self,
-                 name,
-                 trials,
-                 rng,
-                 sequence_constraints,
-                 verbose=False,
-                 out_stream=sys.stdout):
-        super().__init__(name, trials, rng, sequence_constraints, verbose,
-                         out_stream)
+    def __init__(
+        self,
+        name,
+        trials,
+        rng,
+        sequence_constraints,
+        verbose=False,
+        out_stream=sys.stdout,
+    ):
+        super().__init__(name, trials, rng, sequence_constraints, verbose, out_stream)
 
     def check_constraints(self, model_name, sequence_id):
         # The scenario can always be run regardless of the previous runs
         return True
 
-    def run(self,
-            client_metadata,
-            len_mean=SEQUENCE_LENGTH_MEAN,
-            len_stddev=SEQUENCE_LENGTH_STDEV):
+    def run(
+        self,
+        client_metadata,
+        len_mean=SEQUENCE_LENGTH_MEAN,
+        len_stddev=SEQUENCE_LENGTH_STDEV,
+    ):
         trial = self.get_trial()
         dtype = self.get_datatype(trial)
         model_name = tu.get_sequence_model_name(trial, dtype)
@@ -946,9 +999,10 @@ def run(self,
 
         # Create a variable length sequence with "start" and "end" flags.
         seqlen = max(1, int(self.rng_.normal(len_mean, len_stddev)))
-        print("{} {}: valid seqlen = {}".format(self.name_, client_metadata[1],
-                                                seqlen),
-              file=self.out_stream_)
+        print(
+            "{} {}: valid seqlen = {}".format(self.name_, client_metadata[1], seqlen),
+            file=self.out_stream_,
+        )
 
         values = self.rng_.randint(0, 1024 * 1024, size=seqlen).astype(dtype)
 
@@ -965,15 +1019,15 @@ def run(self,
             val = values[idx]
             delay_ms = None
             expected_result += val
-            expected_result = self.get_expected_result(expected_result, val,
-                                                       trial, flags)
+            expected_result = self.get_expected_result(
+                expected_result, val, trial, flags
+            )
 
             # (flag_str, value, expected_result, delay_ms)
-            steps.append((flags, val, expected_result, delay_ms),)
-
-        return self.check_sequence_async(client_metadata,
-                                         trial,
-                                         model_name,
-                                         dtype,
-                                         steps,
-                                         sequence_name=self.name_)
+            steps.append(
+                (flags, val, expected_result, delay_ms),
+            )
+
+        return self.check_sequence_async(
+            client_metadata, trial, model_name, dtype, steps, sequence_name=self.name_
+        )
diff --git a/qa/L0_long_running_stress/stress.py b/qa/L0_long_running_stress/stress.py
old mode 100644
new mode 100755
index 0e52a5edbe..978f204ee6
--- a/qa/L0_long_running_stress/stress.py
+++ b/qa/L0_long_running_stress/stress.py
@@ -1,4 +1,6 @@
-# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#!/usr/bin/env python3
+
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -27,24 +29,25 @@
 import sys
 
 from scenarios import *
+
 sys.path.append("../common")
 
 import argparse
 import bisect
-from builtins import range
-from builtins import str
 import os
-import time
 import threading
+import time
 import traceback
-import numpy as np
+from builtins import range, str
 from functools import partial
-import tritonclient.grpc as grpcclient
+
+import numpy as np
 import prettytable
+import tritonclient.grpc as grpcclient
 
 FLAGS = None
 CORRELATION_ID_BLOCK_SIZE = 1024 * 1024
-BACKENDS = os.environ.get('BACKENDS', "graphdef savedmodel onnx plan")
+BACKENDS = os.environ.get("BACKENDS", "graphdef savedmodel onnx plan")
 
 _thread_exceptions = []
 _thread_exceptions_mutex = threading.Lock()
@@ -62,24 +65,26 @@
 def get_trials(is_sequence=True):
     _trials = ()
     if is_sequence:
-        for backend in BACKENDS.split(' '):
-            if (backend != "libtorch") and (backend != 'savedmodel'):
+        for backend in BACKENDS.split(" "):
+            if (backend != "libtorch") and (backend != "savedmodel"):
                 _trials += (backend + "_nobatch",)
             _trials += (backend,)
     else:
         _trials = ()
-        for backend in BACKENDS.split(' '):
-            if (backend != "libtorch"):
+        for backend in BACKENDS.split(" "):
+            if backend != "libtorch":
                 _trials += (backend + "_nobatch",)
     return _trials
 
 
-def update_test_count(test_case_count,
-                      failed_test_case_count,
-                      request_count,
-                      test_case_name,
-                      success=True,
-                      count=1):
+def update_test_count(
+    test_case_count,
+    failed_test_case_count,
+    request_count,
+    test_case_name,
+    success=True,
+    count=1,
+):
     if success:
         # Count the times each test case runs
         if test_case_name in test_case_count:
@@ -101,7 +106,6 @@ def update_test_count(test_case_count,
 
 
 class ScenarioSelector:
-
     def __init__(self, probs, rng):
         self.rng_ = rng
         self.probs_range_ = []
@@ -118,20 +122,24 @@ def __init__(self, probs, rng):
             self.probs_range_[i] /= total_weight
 
     def get_scenario(self):
-        return self.scenarios_[bisect.bisect_left(self.probs_range_,
-                                                  self.rng_.rand())]
+        return self.scenarios_[bisect.bisect_left(self.probs_range_, self.rng_.rand())]
 
 
-def stress_thread(name, seed, correlation_id_base, test_case_count,
-                  failed_test_case_count, sequence_request_count):
+def stress_thread(
+    name,
+    seed,
+    correlation_id_base,
+    test_case_count,
+    failed_test_case_count,
+    sequence_request_count,
+):
     # Thread responsible for generating sequences of inference
     # requests.
     global _thread_exceptions
 
     # Write any thread output to dedicated file
-    with open("{}.log".format(name), 'w') as out_file:
-        print("Starting thread {} with seed {}".format(name, seed),
-              file=out_file)
+    with open("{}.log".format(name), "w") as out_file:
+        print("Starting thread {} with seed {}".format(name, seed), file=out_file)
         rng = np.random.RandomState(seed)
 
         # FIXME revisit to check if it is necessary
@@ -150,74 +158,111 @@ def stress_thread(name, seed, correlation_id_base, test_case_count,
         rare_cnt = 8
         is_last_used_no_end = {}
 
-        update_counter_fn = partial(update_test_count, test_case_count,
-                                    failed_test_case_count,
-                                    sequence_request_count)
+        update_counter_fn = partial(
+            update_test_count,
+            test_case_count,
+            failed_test_case_count,
+            sequence_request_count,
+        )
         for c in range(common_cnt + rare_cnt):
             client_metadata_list.append(
-                (grpcclient.InferenceServerClient("localhost:8001",
-                                                  verbose=FLAGS.verbose),
-                 correlation_id_base + c))
+                (
+                    grpcclient.InferenceServerClient(
+                        "localhost:8001", verbose=FLAGS.verbose
+                    ),
+                    correlation_id_base + c,
+                )
+            )
         pa_start_seq_id = correlation_id_base + common_cnt + rare_cnt
         pa_end_seq_id = correlation_id_base + CORRELATION_ID_BLOCK_SIZE
 
         # Weight roughly in thousandth percent
-        ss = ScenarioSelector([
-            (60,
-             TimeoutScenario(name,
-                             get_trials(False),
-                             verbose=FLAGS.verbose,
-                             out_stream=out_file)),
-            (80, ResNetScenario(
-                name, verbose=FLAGS.verbose, out_stream=out_file)),
-            (60,
-             CrashingScenario(name, verbose=FLAGS.verbose,
-                              out_stream=out_file)),
-            (62,
-             SequenceNoEndScenario(name,
-                                   get_trials(),
-                                   rng,
-                                   is_last_used_no_end,
-                                   verbose=FLAGS.verbose,
-                                   out_stream=out_file)),
-            (68,
-             SequenceValidNoEndScenario(name,
-                                        get_trials(),
-                                        rng,
-                                        is_last_used_no_end,
-                                        verbose=FLAGS.verbose,
-                                        out_stream=out_file)),
-            (68,
-             SequenceValidValidScenario(name,
-                                        get_trials(),
-                                        rng,
-                                        is_last_used_no_end,
-                                        verbose=FLAGS.verbose,
-                                        out_stream=out_file)),
-            (7,
-             SequenceNoStartScenario(name,
-                                     get_trials(),
-                                     rng,
-                                     is_last_used_no_end,
-                                     verbose=FLAGS.verbose,
-                                     out_stream=out_file)),
-            (295,
-             SequenceValidScenario(name,
-                                   get_trials(),
-                                   rng,
-                                   is_last_used_no_end,
-                                   verbose=FLAGS.verbose,
-                                   out_stream=out_file)),
-            (300,
-             PerfAnalyzerScenario(
-                 name,
-                 rng,
-                 get_trials(),
-                 get_trials(False),
-                 sequence_id_range=(pa_start_seq_id, pa_end_seq_id),
-                 verbose=FLAGS.verbose,
-                 out_stream=out_file)),
-        ], rng)
+        ss = ScenarioSelector(
+            [
+                (
+                    60,
+                    TimeoutScenario(
+                        name,
+                        get_trials(False),
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (80, ResNetScenario(name, verbose=FLAGS.verbose, out_stream=out_file)),
+                (
+                    60,
+                    CrashingScenario(name, verbose=FLAGS.verbose, out_stream=out_file),
+                ),
+                (
+                    62,
+                    SequenceNoEndScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    68,
+                    SequenceValidNoEndScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    68,
+                    SequenceValidValidScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    7,
+                    SequenceNoStartScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    295,
+                    SequenceValidScenario(
+                        name,
+                        get_trials(),
+                        rng,
+                        is_last_used_no_end,
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+                (
+                    300,
+                    PerfAnalyzerScenario(
+                        name,
+                        rng,
+                        get_trials(),
+                        get_trials(False),
+                        sequence_id_range=(pa_start_seq_id, pa_end_seq_id),
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+            ],
+            rng,
+        )
 
         rare_idx = 0
         common_idx = 0
@@ -240,8 +285,9 @@ def stress_thread(name, seed, correlation_id_base, test_case_count,
                 update_counter_fn(scenario.scenario_name(), False)
                 _thread_exceptions_mutex.acquire()
                 try:
-                    _thread_exceptions.append((name, scenario.scenario_name(),
-                                               traceback.format_exc()))
+                    _thread_exceptions.append(
+                        (name, scenario.scenario_name(), traceback.format_exc())
+                    )
                 finally:
                     _thread_exceptions_mutex.release()
 
@@ -255,6 +301,72 @@ def stress_thread(name, seed, correlation_id_base, test_case_count,
         print("Exiting thread {}".format(name), file=out_file)
 
 
+def load_thread(
+    name,
+    seed,
+    correlation_id_base,
+    test_case_count,
+    failed_test_case_count,
+    sequence_request_count,
+):
+    # Thread responsible for generating sequences of inference
+    # requests.
+    global _thread_exceptions
+
+    # Write any thread output to dedicated file
+    with open("{}.log".format(name), "w") as out_file:
+        print("Starting thread {} with seed {}".format(name, seed), file=out_file)
+        rng = np.random.RandomState(seed)
+
+        update_counter_fn = partial(
+            update_test_count,
+            test_case_count,
+            failed_test_case_count,
+            sequence_request_count,
+        )
+        pa_start_seq_id = correlation_id_base
+        pa_end_seq_id = correlation_id_base + CORRELATION_ID_BLOCK_SIZE
+
+        # Create PerfAnalyzerScenario with no additional trial,
+        # the default model 'resnet', more compute intense than the simple
+        # models, will be the only choice for generating load
+        ss = ScenarioSelector(
+            [
+                (
+                    1,
+                    PerfAnalyzerScenario(
+                        name,
+                        rng,
+                        [],
+                        [],
+                        sequence_id_range=(pa_start_seq_id, pa_end_seq_id),
+                        verbose=FLAGS.verbose,
+                        out_stream=out_file,
+                    ),
+                ),
+            ],
+            rng,
+        )
+
+        while not STOP_STRESS_THREAD:
+            scenario = ss.get_scenario()
+            try:
+                res = scenario.run(None)
+                if res is not None:
+                    update_counter_fn(scenario.scenario_name(), count=res)
+            except Exception as ex:
+                update_counter_fn(scenario.scenario_name(), False)
+                _thread_exceptions_mutex.acquire()
+                try:
+                    _thread_exceptions.append(
+                        (name, scenario.scenario_name(), traceback.format_exc())
+                    )
+                finally:
+                    _thread_exceptions_mutex.release()
+
+        print("Exiting thread {}".format(name), file=out_file)
+
+
 def format_content(content, max_line_length):
     # Accumulated line length
     ACC_length = 0
@@ -283,47 +395,45 @@ def accumulate_count(dict_list, test_case_name):
     return count
 
 
-def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
-                    _sequence_request_count):
+def generate_report(
+    elapsed_time, _test_case_count, _failed_test_case_count, _sequence_request_count
+):
     hrs = elapsed_time // 3600
     mins = (elapsed_time / 60) % 60
     secs = elapsed_time % 60
 
     test_case_description = {
-        'SequenceValidScenario':
-            'Send a sequence with "start" and "end" flags.',
-        'SequenceValidValidScenario':
-            'Send two sequences back to back using the same correlation ID'
-            ' with "start" and "end" flags.',
-        'SequenceValidNoEndScenario':
-            'Send two sequences back to back using the same correlation ID.'
-            ' The first with "start" and "end" flags, and the second with no'
-            ' "end" flag.',
-        'SequenceNoStartScenario':
-            'Send a sequence without a "start" flag. Sequence should get an'
-            ' error from the server.',
-        'SequenceNoEndScenario':
-            'Send a sequence with "start" flag but that never ends. The'
-            ' sequence should be aborted by the server and its slot reused'
-            ' for another sequence.',
-        'TimeoutScenario':
-            'Expect an exception for small timeout values.',
-        'ResNetScenario':
-            'Send a request using resnet model.',
-        'CrashingScenario':
-            'Client crashes in the middle of inferences.',
-        'PerfAnalyzerScenario':
-            'Client that maintains a specific load.',
+        "SequenceValidScenario": 'Send a sequence with "start" and "end" flags.',
+        "SequenceValidValidScenario": "Send two sequences back to back using the same correlation ID"
+        ' with "start" and "end" flags.',
+        "SequenceValidNoEndScenario": "Send two sequences back to back using the same correlation ID."
+        ' The first with "start" and "end" flags, and the second with no'
+        ' "end" flag.',
+        "SequenceNoStartScenario": 'Send a sequence without a "start" flag. Sequence should get an'
+        " error from the server.",
+        "SequenceNoEndScenario": 'Send a sequence with "start" flag but that never ends. The'
+        " sequence should be aborted by the server and its slot reused"
+        " for another sequence.",
+        "TimeoutScenario": "Expect an exception for small timeout values.",
+        "ResNetScenario": "Send a request using resnet model.",
+        "CrashingScenario": "Client crashes in the middle of inferences.",
+        "PerfAnalyzerScenario": "Client that maintains a specific load.",
     }
 
     f = open("stress_report.txt", "w")
-    f.write("Test Duration: {:0>2}:{:0>2}:{:0>2} (HH:MM:SS)\n".format(
-        int(hrs), int(mins), int(secs)))
+    f.write(
+        "Test Duration: {:0>2}:{:0>2}:{:0>2} (HH:MM:SS)\n".format(
+            int(hrs), int(mins), int(secs)
+        )
+    )
 
     t = prettytable.PrettyTable(hrules=prettytable.ALL)
     t.field_names = [
-        'Test Case', 'Number of Failures', 'Test Count', 'Request Count',
-        'Test Case Description'
+        "Test Case",
+        "Number of Failures",
+        "Test Count",
+        "Request Count",
+        "Test Case Description",
     ]
 
     t.align["Test Case"] = "l"
@@ -339,33 +449,38 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     for c in test_case_description:
         # Accumulate all the individual thread counts
         acc_test_case_count[c] = accumulate_count(_test_case_count, c)
-        acc_failed_test_case_count[c] = accumulate_count(
-            _failed_test_case_count, c)
-        acc_sequence_request_count[c] = accumulate_count(
-            _sequence_request_count, c)
+        acc_failed_test_case_count[c] = accumulate_count(_failed_test_case_count, c)
+        acc_sequence_request_count[c] = accumulate_count(_sequence_request_count, c)
 
         description = test_case_description[c]
         # Add additional description on scenarios that allow failure
         if c in ALLOW_FAILURE_SCENARIO:
-            description += " Note that this scenario is marked to allow " \
-                           "failure due to subtle edge cases that will be " \
-                           "investigated in the future. However, only a " \
-                           "minimal failure count is expected and we should " \
-                           "take action if the number is concerning."
-        t.add_row([
-            c, acc_failed_test_case_count[c] if c in acc_failed_test_case_count
-            else 0, acc_test_case_count[c] if c in acc_test_case_count else 0,
-            acc_sequence_request_count[c]
-            if c in acc_sequence_request_count else 0,
-            format_content(description, 50)
-        ])
-
-    t.add_row([
-        'TOTAL',
-        sum(acc_failed_test_case_count.values()),
-        sum(acc_test_case_count.values()),
-        sum(acc_sequence_request_count.values()), 'X'
-    ])
+            description += (
+                " Note that this scenario is marked to allow "
+                "failure due to subtle edge cases that will be "
+                "investigated in the future. However, only a "
+                "minimal failure count is expected and we should "
+                "take action if the number is concerning."
+            )
+        t.add_row(
+            [
+                c,
+                acc_failed_test_case_count[c] if c in acc_failed_test_case_count else 0,
+                acc_test_case_count[c] if c in acc_test_case_count else 0,
+                acc_sequence_request_count[c] if c in acc_sequence_request_count else 0,
+                format_content(description, 50),
+            ]
+        )
+
+    t.add_row(
+        [
+            "TOTAL",
+            sum(acc_failed_test_case_count.values()),
+            sum(acc_test_case_count.values()),
+            sum(acc_sequence_request_count.values()),
+            "X",
+        ]
+    )
 
     print(t)
     f.write(str(t))
@@ -373,33 +488,48 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     f.close()
 
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument('-v',
-                        '--verbose',
-                        action="store_true",
-                        required=False,
-                        default=False,
-                        help='Enable verbose output')
-    parser.add_argument('-r',
-                        '--random-seed',
-                        type=int,
-                        required=False,
-                        help='Random seed.')
-    parser.add_argument('-t',
-                        '--concurrency',
-                        type=int,
-                        required=False,
-                        default=8,
-                        help='Request concurrency. Default is 8.')
     parser.add_argument(
-        '-d',
-        '--test-duration',
+        "-v",
+        "--verbose",
+        action="store_true",
+        required=False,
+        default=False,
+        help="Enable verbose output",
+    )
+    parser.add_argument(
+        "-r", "--random-seed", type=int, required=False, help="Random seed."
+    )
+    parser.add_argument(
+        "-t",
+        "--concurrency",
+        type=int,
+        required=False,
+        default=8,
+        help="Request concurrency. Default is 8.",
+    )
+    parser.add_argument(
+        "--load-thread",
+        type=int,
+        required=False,
+        default=0,
+        help="Number of dedicated threads that keep compute "
+        "device (i.e. GPU/CPUs) under load. The load generated "
+        'from "--concurrency" often behaves as request spike, '
+        " this argument may be used to produce consistent load "
+        " to keep devices at high utilization. Default is 0, "
+        "which means no dedicated load thread will be created.",
+    )
+    parser.add_argument(
+        "-d",
+        "--test-duration",
         type=int,
         required=False,
         default=25000,
-        help='Duration of stress test to run. Default is 25000 seconds ' +
-        '(approximately 7 hours).')
+        help="Duration of stress test to run. Default is 25000 seconds "
+        + "(approximately 7 hours).",
+    )
     FLAGS = parser.parse_args()
 
     # Initialize the random seed. For reproducibility each thread
@@ -416,13 +546,17 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     print("test duration = {}".format(FLAGS.test_duration))
 
     # Create hashes for each thread for generating report
-    _test_case_count = [dict() for x in range(FLAGS.concurrency)]
-    _failed_test_case_count = [dict() for x in range(FLAGS.concurrency)]
-    _sequence_request_count = [dict() for x in range(FLAGS.concurrency)]
+    _test_case_count = [dict() for _ in range(FLAGS.concurrency + FLAGS.load_thread)]
+    _failed_test_case_count = [
+        dict() for _ in range(FLAGS.concurrency + FLAGS.load_thread)
+    ]
+    _sequence_request_count = [
+        dict() for _ in range(FLAGS.concurrency + FLAGS.load_thread)
+    ]
 
     threads = []
 
-    for idx, thd in enumerate(range(FLAGS.concurrency)):
+    for idx in range(FLAGS.concurrency):
         thread_name = "thread_{}".format(idx)
 
         # Create the seed for the thread. Since these are created in
@@ -435,11 +569,46 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
         correlation_id_base = 1 + (idx * CORRELATION_ID_BLOCK_SIZE)
 
         threads.append(
-            threading.Thread(target=stress_thread,
-                             args=(thread_name, seed, correlation_id_base,
-                                   _test_case_count[idx],
-                                   _failed_test_case_count[idx],
-                                   _sequence_request_count[idx])))
+            threading.Thread(
+                target=stress_thread,
+                args=(
+                    thread_name,
+                    seed,
+                    correlation_id_base,
+                    _test_case_count[idx],
+                    _failed_test_case_count[idx],
+                    _sequence_request_count[idx],
+                ),
+            )
+        )
+
+    for idx in range(FLAGS.load_thread):
+        thread_name = "load_thread_{}".format(idx)
+
+        # Create the seed for the thread. Since these are created in
+        # reproducible order off of the initial seed we will get
+        # reproducible results when given the same seed.
+        seed = np.random.randint(2**32)
+
+        # Each thread is reserved a block of correlation IDs or size
+        # CORRELATION_ID_BLOCK_SIZE
+        correlation_id_base = 1 + (
+            (FLAGS.concurrency + idx) * CORRELATION_ID_BLOCK_SIZE
+        )
+
+        threads.append(
+            threading.Thread(
+                target=load_thread,
+                args=(
+                    thread_name,
+                    seed,
+                    correlation_id_base,
+                    _test_case_count[idx],
+                    _failed_test_case_count[idx],
+                    _sequence_request_count[idx],
+                ),
+            )
+        )
 
     exit_code = 0
 
@@ -447,15 +616,13 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
     for t in threads:
         t.start()
 
-    liveness_count = 0
-    while liveness_count < FLAGS.test_duration:
+    while (time.time() - start_time) < FLAGS.test_duration:
         time.sleep(1)
         for t in threads:
             # Stop the test early if there is early termination of a thread.
             if not t.is_alive():
                 exit_code = 1
                 break
-        liveness_count += 1
         if exit_code != 0:
             break
 
@@ -467,15 +634,18 @@ def generate_report(elapsed_time, _test_case_count, _failed_test_case_count,
         if t.is_alive() and (exit_code == 0):
             exit_code = 1
 
-    generate_report(time.time() - start_time, _test_case_count,
-                    _failed_test_case_count, _sequence_request_count)
+    generate_report(
+        time.time() - start_time,
+        _test_case_count,
+        _failed_test_case_count,
+        _sequence_request_count,
+    )
 
     _thread_exceptions_mutex.acquire()
     try:
         if len(_thread_exceptions) > 0:
             for thread, scenario, ex in _thread_exceptions:
-                print("*********\n* {} {}\n{}*********\n".format(
-                    thread, scenario, ex))
+                print("*********\n* {} {}\n{}*********\n".format(thread, scenario, ex))
                 if scenario not in ALLOW_FAILURE_SCENARIO:
                     exit_code = 1
     finally:
diff --git a/qa/L0_long_running_stress/stress_mail.py b/qa/L0_long_running_stress/stress_mail.py
old mode 100644
new mode 100755
index 9f9e1b660e..36f347c2ac
--- a/qa/L0_long_running_stress/stress_mail.py
+++ b/qa/L0_long_running_stress/stress_mail.py
@@ -26,23 +26,37 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import sys
+
 sys.path.append("../common")
 
 import os
-import nightly_email_helper
-
 from datetime import date
 
-CI_JOB_ID = os.environ.get('CI_JOB_ID', '')
+import nightly_email_helper
+
+CI_JOB_ID = os.environ.get("CI_JOB_ID", "")
 
-if __name__ == '__main__':
+if __name__ == "__main__":
     today = date.today().strftime("%Y-%m-%d")
-    subject = "Triton Long-Running Stress Test Summary: " + today
+    subject = (
+        "Triton Long-Running Stress Test "
+        + ((sys.argv[1] + " ") if len(sys.argv) >= 2 else "")
+        + "Summary: "
+        + today
+    )
     stress_report = "stress_report.txt"
     link = "https://gitlab-master.nvidia.com/dl/dgx/tritonserver/-/jobs/" + CI_JOB_ID
     write_up = "

The table below includes results from long-running stress test. Please refer to the description of each test case to see what different kinds of inference requests were sent. Request concurrency is set to 8.

" - write_up += "

Please check the CI output webpage for the details of the failures: " + link + "

" - html_content = "
" + write_up + "
"
+    write_up += (
+        "

Please check the CI output webpage for the details of the failures: " + + link + + "

" + ) + html_content = ( + '
'
+        + write_up
+        + '
'
+    )
     with open(stress_report, "r") as f:
         html_content += f.read() + "\n"
     html_content += "
" diff --git a/qa/L0_long_running_stress/test.sh b/qa/L0_long_running_stress/test.sh index 6e0632809c..b98a89f955 100755 --- a/qa/L0_long_running_stress/test.sh +++ b/qa/L0_long_running_stress/test.sh @@ -47,6 +47,19 @@ DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"} SERVER=/opt/tritonserver/bin/tritonserver source ../common/util.sh +# If the test should be run in long and high load setting +if [ "$TRITON_PERF_LONG" == 1 ]; then + # ~ 6.5 days + TEST_DURATION=480000 + LOAD_THREAD_COUNT=2 + EMAIL_SUBJECT="Long" +else + # ~ 7 hours + TEST_DURATION=25000 + LOAD_THREAD_COUNT=0 + EMAIL_SUBJECT="" +fi + RET=0 # If BACKENDS not specified, set to all @@ -57,7 +70,7 @@ export CI_JOB_ID=${CI_JOB_ID} MODEL_DIR=models -rm -fr *.log *.txt *.serverlog models validation_data csv_dir && mkdir models validation_data csv_dir +rm -fr *.log *.txt models validation_data csv_dir && mkdir models validation_data csv_dir # Get the datatype to use based on the backend function get_datatype () { @@ -124,10 +137,8 @@ cp -r $DATADIR/tf_model_store/resnet_v1_50_graphdef $MODEL_DIR/resnet_v1_50_grap sed -i 's/^name: "resnet_v1_50_graphdef"/name: "resnet_v1_50_graphdef_def"/' config.pbtxt && \ echo "optimization { }" >> config.pbtxt) -python -m pip install -U prettytable - SERVER_ARGS="--model-repository=`pwd`/$MODEL_DIR" -SERVER_LOG="./serverlog" +SERVER_LOG="./server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -136,8 +147,9 @@ if [ "$SERVER_PID" == "0" ]; then fi set +e -python $STRESS_TEST >>$CLIENT_LOG 2>&1 +python $STRESS_TEST -d ${TEST_DURATION} --load-thread ${LOAD_THREAD_COUNT} >>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then + cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 fi @@ -154,8 +166,8 @@ else fi # Run only if both TRITON_FROM and TRITON_TO_DL are set -if [[ ! -z "$TRITON_FROM" ]] || [[ ! -z "$TRITON_TO_DL" ]]; then - python stress_mail.py +if [[ ! -z "$TRITON_FROM" ]] && [[ ! -z "$TRITON_TO_DL" ]]; then + python stress_mail.py "$EMAIL_SUBJECT" fi exit $RET diff --git a/qa/L0_memory/test.sh b/qa/L0_memory/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_memory_growth/busy_op_test.py b/qa/L0_memory_growth/busy_op_test.py old mode 100644 new mode 100755 index 537c328047..2814f38d8c --- a/qa/L0_memory_growth/busy_op_test.py +++ b/qa/L0_memory_growth/busy_op_test.py @@ -27,56 +27,63 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse -import numpy as np from builtins import range + +import numpy as np import tritongrpcclient as grpcclient import tritonhttpclient as httpclient from tritonclientutils import np_to_triton_dtype FLAGS = None -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() - parser.add_argument('-v', - '--verbose', - action="store_true", - required=False, - default=False, - help='Enable verbose output') - parser.add_argument('-u', - '--url', - type=str, - required=False, - default='localhost:8000', - help='Inference server URL. Default is localhost:8000.') parser.add_argument( - '-i', - '--protocol', + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-u", + "--url", type=str, required=False, - default='http', - help='Protocol ("http"/"grpc") used to ' + - 'communicate with inference service. Default is "http".') - parser.add_argument('-m', - '--model', - type=str, - required=True, - help='Name of model.') - parser.add_argument('-n', - '--num-requests', - type=int, - required=True, - help='Number of asynchronous requests to launch.') - parser.add_argument('-d', - '--delay', - type=int, - required=True, - help='Number of delay cycles to use as input to model.') + default="localhost:8000", + help="Inference server URL. Default is localhost:8000.", + ) + parser.add_argument( + "-i", + "--protocol", + type=str, + required=False, + default="http", + help='Protocol ("http"/"grpc") used to ' + + 'communicate with inference service. Default is "http".', + ) + parser.add_argument("-m", "--model", type=str, required=True, help="Name of model.") + parser.add_argument( + "-n", + "--num-requests", + type=int, + required=True, + help="Number of asynchronous requests to launch.", + ) + parser.add_argument( + "-d", + "--delay", + type=int, + required=True, + help="Number of delay cycles to use as input to model.", + ) FLAGS = parser.parse_args() if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"): - print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format( - FLAGS.protocol)) + print( + 'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol) + ) exit(1) client_util = httpclient if FLAGS.protocol == "http" else grpcclient @@ -94,8 +101,9 @@ input_data = np.array([FLAGS.delay], dtype=np.int32) inputs = [ - client_util.InferInput("in", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + client_util.InferInput( + "in", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) diff --git a/qa/L0_memory_growth/server_memory_mail.py b/qa/L0_memory_growth/server_memory_mail.py old mode 100644 new mode 100755 index 9ad0279df5..d1307d97a6 --- a/qa/L0_memory_growth/server_memory_mail.py +++ b/qa/L0_memory_growth/server_memory_mail.py @@ -26,21 +26,26 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys -sys.path.append("../common") -import nightly_email_helper +sys.path.append("../common") import glob from datetime import date -if __name__ == '__main__': +import nightly_email_helper + +if __name__ == "__main__": today = date.today().strftime("%Y-%m-%d") subject = "Triton Server Memory Growth " + sys.argv[1] + " Summary: " + today memory_graphs_resnet = glob.glob("memory_growth_resnet*.log") memory_graphs_busyop = glob.glob("memory_growth_busyop.log") write_up = "

This test uses perf_analyzer as clients running on 4 different models. The max allowed difference between mean and maximum memory usage is set to 150MB.

" write_up += "

• What to look for
A linear memory growth in the beginning of the graph is acceptable only when it is followed by a flat memory usage. If a linear memory growth is observed during the entire test then there is possibly a memory leak.

" - html_content = "
" + write_up + "
"
+    html_content = (
+        '        
         
 
'
+        + write_up
+        + '
'
+    )
     for mem_graph in sorted(memory_graphs_resnet):
         html_content += "\n" + mem_graph + "\n"
         with open(mem_graph, "r") as f:
@@ -51,12 +56,18 @@
     # When we see PTX failures in CI, the busyop memory graph is not created.
     if len(memory_graphs_busyop):
         write_up = "

• What to look for
The memory usage should increase continually over time, and a linear growth should be observed in the graph below.

" - html_content += "
" + write_up + "
"
+        html_content += (
+            '
'
+            + write_up
+            + '
'
+        )
         for mem_graph in sorted(memory_graphs_busyop):
             html_content += "\n" + mem_graph + "\n"
             with open(mem_graph, "r") as f:
                 html_content += f.read() + "\n"
     else:
-        html_content += "

The busyop model caused PTX failures when running the CI.

" + html_content += ( + "

The busyop model caused PTX failures when running the CI.

" + ) html_content += "
" nightly_email_helper.send(subject, html_content, is_html=True) diff --git a/qa/L0_memory_growth/test.sh b/qa/L0_memory_growth/test.sh index 4721542ebd..64277e6b6e 100755 --- a/qa/L0_memory_growth/test.sh +++ b/qa/L0_memory_growth/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -76,18 +76,28 @@ CLIENT_BS=8 # Set the number of repetitions in nightly and weekly tests # Set the email subject for nightly and weekly tests if [ "$TRITON_PERF_WEEKLY" == 1 ]; then - # Run the test for each model approximately 1.5 hours - # All tests are run cumulatively for 7 hours - REPETITION=200 - EMAIL_SUBJECT="Weekly" + if [ "$TRITON_PERF_LONG" == 1 ]; then + # ~ 2.5 days for system under test + REPETITION=1400 + EMAIL_SUBJECT="Weekly Long" + else + # Run the test for each model approximately 1.5 hours + # All tests are run cumulatively for 7 hours + REPETITION=200 + EMAIL_SUBJECT="Weekly" + fi else REPETITION=3 EMAIL_SUBJECT="Nightly" fi # Threshold memory growth in MB -MAX_ALLOWED_ALLOC="150" -export MAX_ALLOWED_ALLOC +# NOTES: +# - Bounded memory growth tests typically show < 70 MB usage +# - Plan/ONNX is typically between 20-40 MB +# - Savedmodel is closer to 50-70 MB +# - Unbounded memory growth test typically shows > 100 MB usage +export MAX_ALLOWED_ALLOC="100" # Create local model repository mkdir -p models/ @@ -114,6 +124,12 @@ set -e RET=0 for MODEL in $(ls models); do + # Skip the resnet50_fp32_libtorch model as it is running into `misaligned address' + # Tracked here: https://nvbugs/3954104 + if [ "$MODEL" == "resnet50_fp32_libtorch" ]; then + continue + fi + # Create temporary model repository and copy only the model being tested rm -rf test_repo && mkdir test_repo cp -r models/$MODEL test_repo/ @@ -146,13 +162,25 @@ for MODEL in $(ls models); do set +e + TEMP_CLIENT_LOG=temp_client.log + TEMP_RET=0 + SECONDS=0 # Run the perf analyzer 'REPETITION' times for ((i=1; i<=$REPETITION; i++)); do - $PERF_ANALYZER -v -m $MODEL -i grpc --concurrency-range $CONCURRENCY -b $CLIENT_BS >> $CLIENT_LOG 2>&1 - if [ $? -ne 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** perf_analyzer for $MODEL failed on iteration $i\n***" + # [TMA-621] Use --no-stability mode in perf analyzer when available + $PERF_ANALYZER -v -m $MODEL -i grpc --concurrency-range $CONCURRENCY -b $CLIENT_BS > $TEMP_CLIENT_LOG 2>&1 + PA_RET=$? + # Success + if [ ${PA_RET} -eq 0 ]; then + continue + # Unstable measurement: OK for this test + elif [ ${PA_RET} -eq 2 ]; then + continue + # Other failures unexpected, report error + else + cat $TEMP_CLIENT_LOG >> $CLIENT_LOG + echo -e "\n***\n*** perf_analyzer for $MODEL failed on iteration $i\n***" >> $CLIENT_LOG RET=1 fi done @@ -177,9 +205,11 @@ for MODEL in $(ls models); do python $MASSIF_TEST $MASSIF_LOG $MAX_ALLOWED_ALLOC --start-from-middle >> $CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG - echo -e "\n***\n*** Test for $MODEL Failed\n***" + echo -e "\n***\n*** Test for $MODEL Failed.\n***" RET=1 fi + # Always output memory usage for easier triage of MAX_ALLOWED_ALLOC settings in the future + grep -i "Change in memory allocation" "${CLIENT_LOG}" || true set -e done @@ -194,7 +224,7 @@ rm -rf test_repo && mkdir test_repo cp -r ${DATADIR}/qa_custom_ops/tf_custom_ops/graphdef_busyop test_repo/ # Explicitly set library path so custom ops can find TF -LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow1 +LD_LIBRARY_PATH=/opt/tritonserver/backends/tensorflow:$LD_LIBRARY_PATH SERVER_ARGS="--model-repository=`pwd`/test_repo" SERVER_LD_PRELOAD="${DATADIR}/qa_custom_ops/tf_custom_ops/libbusyop.so" @@ -225,8 +255,9 @@ set +e if [ $SKIP_BUSYOP -ne 1 ]; then SECONDS=0 python $BUSY_OP_TEST -v -m graphdef_busyop -d $DELAY_CYCLES -n $NUM_REQUESTS > $CLIENT_LOG 2>&1 + TEST_RETCODE=$? TEST_DURATION=$SECONDS - if [ $? -ne 0 ]; then + if [ ${TEST_RETCODE} -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test graphdef_busyop Failed\n***" RET=1 @@ -248,11 +279,17 @@ if [ $SKIP_BUSYOP -ne 1 ]; then cat ${GRAPH_LOG} # Check the massif output python $MASSIF_TEST $MASSIF_LOG $MAX_ALLOWED_ALLOC --start-from-middle >> $CLIENT_LOG 2>&1 + # This busyop test is expected to return a non-zero error since it is + # intentionally testing unbounded growth. If it returns success for some + # reason, raise error. if [ $? -ne 1 ]; then cat $CLIENT_LOG - echo -e "\n***\n*** Test for graphdef_busyop Failed\n***" + echo -e "\n***\n*** Massif test for graphdef_busyop Failed\n***" + echo -e "\n***\n*** Expected unbounded growth, but found acceptable growth within ${MAX_ALLOWED_ALLOC} MB\n***" RET=1 fi + # Always output memory usage for easier triage of MAX_ALLOWED_ALLOC settings in the future + grep -i "Change in memory allocation" "${CLIENT_LOG}" || true fi set -e @@ -263,8 +300,8 @@ else fi # Run only if both TRITON_FROM and TRITON_TO_DL are set -if [[ ! -z "$TRITON_FROM" ]] || [[ ! -z "$TRITON_TO_DL" ]]; then - python server_memory_mail.py $EMAIL_SUBJECT +if [[ ! -z "$TRITON_FROM" ]] && [[ ! -z "$TRITON_TO_DL" ]]; then + python server_memory_mail.py "$EMAIL_SUBJECT" fi exit $RET diff --git a/qa/L0_metrics/ensemble_delay/config.pbtxt b/qa/L0_metrics/ensemble_delay/config.pbtxt new file mode 100644 index 0000000000..0eaa2f76f7 --- /dev/null +++ b/qa/L0_metrics/ensemble_delay/config.pbtxt @@ -0,0 +1,67 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 4 + +input [ + { + name: "ENSEMBLE_INPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +output [ + { + name: "ENSEMBLE_OUTPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + }, + { + name: "ENSEMBLE_OUTPUT1" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +ensemble_scheduling +{ + step [ + { + model_name: "dynamic_composing" + model_version: -1 + input_map { key: "INPUT0", value: "ENSEMBLE_INPUT0" } + output_map { key: "OUTPUT0", value: "ENSEMBLE_OUTPUT0" } + }, + { + model_name: "default_composing" + model_version: -1 + input_map { key: "INPUT0", value: "ENSEMBLE_INPUT0" } + output_map { key: "OUTPUT0", value: "ENSEMBLE_OUTPUT1" } + } + ] +} diff --git a/qa/L0_metrics/identity_delay/config.pbtxt b/qa/L0_metrics/identity_delay/config.pbtxt new file mode 100644 index 0000000000..1062868c2b --- /dev/null +++ b/qa/L0_metrics/identity_delay/config.pbtxt @@ -0,0 +1,58 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "identity" +max_batch_size: 4 + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +instance_group [ + { + count: 1 + kind : KIND_CPU + } +] + +parameters [ + { + key: "execute_delay_ms" + value: { string_value: "2000" } + } +] diff --git a/qa/L0_metrics/metrics_config_test.py b/qa/L0_metrics/metrics_config_test.py new file mode 100755 index 0000000000..a1324ac28e --- /dev/null +++ b/qa/L0_metrics/metrics_config_test.py @@ -0,0 +1,134 @@ +#!/usr/bin/python +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import os +import sys + +sys.path.append("../common") + +import unittest + +import requests +import test_util as tu + +INF_COUNTER_PATTERNS = [ + "nv_inference_request_duration", + "nv_inference_queue_duration", + "nv_inference_compute_input_duration", + "nv_inference_compute_infer_duration", + "nv_inference_compute_output_duration", +] +INF_SUMMARY_PATTERNS = [ + "nv_inference_request_summary", + "nv_inference_queue_summary", + "nv_inference_compute_input_summary", + "nv_inference_compute_infer_summary", + "nv_inference_compute_output_summary", +] +CACHE_COUNTER_PATTERNS = [ + "nv_cache_num_hits_per_model", + "nv_cache_num_misses_per_model", + "nv_cache_hit_duration_per_model", + "nv_cache_miss_duration_per_model", +] +CACHE_SUMMARY_PATTERNS = ["nv_cache_hit_summary", "nv_cache_miss_summary"] + + +class MetricsConfigTest(tu.TestResultCollector): + def _get_metrics(self): + metrics_url = "http://localhost:8002/metrics" + r = requests.get(metrics_url) + r.raise_for_status() + return r.text + + # Counters + def test_inf_counters_exist(self): + metrics = self._get_metrics() + for metric in INF_COUNTER_PATTERNS: + self.assertIn(metric, metrics) + + def test_inf_counters_missing(self): + metrics = self._get_metrics() + for metric in INF_COUNTER_PATTERNS: + self.assertNotIn(metric, metrics) + + def test_cache_counters_exist(self): + metrics = self._get_metrics() + for metric in CACHE_COUNTER_PATTERNS: + self.assertIn(metric, metrics) + + def test_cache_counters_missing(self): + metrics = self._get_metrics() + for metric in CACHE_COUNTER_PATTERNS: + self.assertNotIn(metric, metrics) + + # Summaries + def test_inf_summaries_exist(self): + metrics = self._get_metrics() + for metric in INF_SUMMARY_PATTERNS: + self.assertIn(metric, metrics) + + def test_inf_summaries_missing(self): + metrics = self._get_metrics() + for metric in INF_SUMMARY_PATTERNS: + self.assertNotIn(metric, metrics) + + def test_cache_summaries_exist(self): + metrics = self._get_metrics() + for metric in CACHE_SUMMARY_PATTERNS: + self.assertIn(metric, metrics) + + def test_cache_summaries_missing(self): + metrics = self._get_metrics() + for metric in CACHE_SUMMARY_PATTERNS: + self.assertNotIn(metric, metrics) + + def test_summaries_custom_quantiles(self): + metrics = self._get_metrics() + # This env var should be set by test.sh or caller + quantile_pairs = os.environ.get("SUMMARY_QUANTILES", None) + self.assertIsNotNone(quantile_pairs) + + quantiles = [pair.split(":")[0] for pair in quantile_pairs.split(",")] + print(metrics) + for quantile in quantiles: + print(quantile) + self.assertIn(f'quantile="{quantile}"', metrics) + + # DLIS-4762: Disable request summary when caching enabled for now + def test_inf_summaries_exist_with_cache(self): + metrics = self._get_metrics() + bad_patterns = ["nv_inference_request_summary"] + ok_patterns = list(set(INF_SUMMARY_PATTERNS) - set(bad_patterns)) + for metric in ok_patterns: + self.assertIn(metric, metrics) + for metric in bad_patterns: + self.assertNotIn(metric, metrics) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_metrics/metrics_queue_size_test.py b/qa/L0_metrics/metrics_queue_size_test.py new file mode 100755 index 0000000000..0554274109 --- /dev/null +++ b/qa/L0_metrics/metrics_queue_size_test.py @@ -0,0 +1,306 @@ +#!/usr/bin/python +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import math +import time +import unittest +from functools import partial + +import numpy as np +import requests +import test_util as tu +import tritonclient.http +from tritonclient.utils import triton_to_np_dtype + +QUEUE_METRIC_TEMPLATE = ( + 'nv_inference_pending_request_count{{model="{model_name}",version="1"}}' +) +INFER_METRIC_TEMPLATE = 'nv_inference_count{{model="{model_name}",version="1"}}' +EXEC_METRIC_TEMPLATE = 'nv_inference_exec_count{{model="{model_name}",version="1"}}' + + +class MetricsPendingRequestCountTest(tu.TestResultCollector): + def setUp(self): + self.metrics = None + self.metrics_url = "http://localhost:8002/metrics" + self.server_url = "localhost:8000" + + # Used to verify model config is set to expected values + self.max_batch_size = 4 + self.delay_ms = 2000 + self.delay_sec = self.delay_ms // 1000 + + # Setup dummy inputs + dtype = "FP32" + shape = (1, 1) + input_np = np.ones(shape, dtype=triton_to_np_dtype(dtype)) + self.inputs = [ + tritonclient.http.InferInput("INPUT0", shape, dtype).set_data_from_numpy( + input_np + ) + ] + self.ensemble_inputs = [ + tritonclient.http.InferInput( + "ENSEMBLE_INPUT0", shape, dtype + ).set_data_from_numpy(input_np) + ] + + # Verify values for filling request queues + self.num_requests = 10 + self.concurrency = 10 + # Concurrency must be at least as high as number of async requests we intend + # to send N requests to fill request queues before blocking on any results. + self.assertGreaterEqual(self.concurrency, self.num_requests) + self.client = tritonclient.http.InferenceServerClient( + url=self.server_url, concurrency=self.concurrency + ) + + # Test specific configurations + self.max_queue_size = 0 + + def _validate_model_config(self, model_name, max_queue_size=0): + config = self.client.get_model_config(model_name) + print(config) + params = config.get("parameters", {}) + delay_ms = int(params.get("execute_delay_ms", {}).get("string_value")) + max_batch_size = config.get("max_batch_size") + self.assertEqual(delay_ms, self.delay_ms) + self.assertEqual(max_batch_size, self.max_batch_size) + + dynamic_batching = config.get("dynamic_batching", {}) + default_queue_policy = dynamic_batching.get("default_queue_policy", {}) + self.max_queue_size = default_queue_policy.get("max_queue_size", 0) + + self.assertEqual(self.max_queue_size, max_queue_size) + + return config + + def _get_metrics(self): + r = requests.get(self.metrics_url) + r.raise_for_status() + return r.text + + def _get_metric_line(self, metric, metrics): + for line in metrics.splitlines(): + if metric in line: + return line + return None + + def _get_metric_value(self, metric): + metrics = self._get_metrics() + self.assertIn(metric, metrics) + line = self._get_metric_line(metric, metrics) + print(line) + if not line: + return None + value = line.split()[1] + return float(value) + + def _assert_metric_equals(self, metric, expected_value): + value = self._get_metric_value(metric) + self.assertEqual(value, expected_value) + + def _assert_metric_greater_than(self, metric, gt_value): + value = self._get_metric_value(metric) + self.assertGreater(value, gt_value) + + def _send_async_requests(self, model_name, inputs, futures): + for _ in range(self.num_requests): + futures.append(self.client.async_infer(model_name, inputs)) + + def _send_async_requests_sequence(self, num_seq_slots, model_name, inputs, futures): + started_seqs = {} + num_sent = 0 + while num_sent < self.num_requests: + # Add requests to each sequence slot round-robin, seq_id must be > 0 + # We don't care about finishing any sequences, just need to queue up + # requests for each sequence until num_requests is hit. + seq_id = (num_sent % num_seq_slots) + 1 + # Toggle start flag to False after first request per sequence ID + start = True if seq_id not in started_seqs else False + started_seqs[seq_id] = True + futures.append( + self.client.async_infer( + model_name, + inputs, + request_id=str(num_sent), + sequence_id=seq_id, + sequence_start=start, + ) + ) + num_sent += 1 + + def _test_helper( + self, model_name, batch_size, send_requests_func, max_queue_size=0 + ): + self._validate_model_config(model_name, max_queue_size=max_queue_size) + + queue_size = QUEUE_METRIC_TEMPLATE.format(model_name=model_name) + infer_count = INFER_METRIC_TEMPLATE.format(model_name=model_name) + exec_count = EXEC_METRIC_TEMPLATE.format(model_name=model_name) + # Metric should be zero before sending any requests + self._assert_metric_equals(queue_size, 0) + # Send N requests, letting scheduler delay queue fill up when applicable + futures = [] + send_requests_func(model_name, self.inputs, futures) + # Give Triton a second to load all requests into queues + time.sleep(1) + + # Start from (num_requests-batch_size) because 1 batch should be executing, + # and the rest of the requests should be queued. + # If max_queue_size is specified then the queued requests would be capped + # at max_queue_size. + if max_queue_size != 0: + self._assert_metric_equals(queue_size, max_queue_size) + starting_queue_size = max_queue_size + else: + starting_queue_size = self.num_requests - batch_size + + for expected_queue_size in range(starting_queue_size, 0, -1 * batch_size): + self._assert_metric_equals(queue_size, expected_queue_size) + time.sleep(self.delay_sec) + # Queue should be empty now + self._assert_metric_equals(queue_size, 0) + # Let final batch finish + time.sleep(self.delay_sec) + + # All requests should've been executed without any batching + expected_infer_count = starting_queue_size + batch_size + self._assert_metric_equals(infer_count, expected_infer_count) + expected_exec_count = math.ceil(expected_infer_count / batch_size) + self._assert_metric_equals(exec_count, expected_exec_count) + + failed_count = 0 + for future in futures: + try: + future.get_result() + except Exception as e: + failed_count = failed_count + 1 + + self.assertEqual( + failed_count, self.num_requests - batch_size - starting_queue_size + ) + + def test_default_scheduler(self): + model_name = "default" + # Default scheduler won't do any batching + batch_size = 1 + self._test_helper(model_name, batch_size, self._send_async_requests) + + def test_dynamic_batch_scheduler(self): + model_name = "dynamic" + # With sufficient queue delay set, we expect full batches to be executed + batch_size = self.max_batch_size + self._test_helper(model_name, batch_size, self._send_async_requests) + + def test_fail_max_queue_size(self): + model_name = "max_queue_size" + # This test checks whether metrics are properly accounts for requests + # that fail to enqueue on the server. The test sets the max_queue_size + # and any additional requests beyond the specified queue size should fail + # instead of waiting for execution. + batch_size = self.max_batch_size + self._test_helper( + model_name, batch_size, self._send_async_requests, max_queue_size=4 + ) + + def test_sequence_batch_scheduler_direct(self): + model_name = "sequence_direct" + # With sufficient queue delay and minimum_slot_utilization set, we + # expect full batches to be executed. + batch_size = self.max_batch_size + num_seq_slots = batch_size + send_requests_func = partial(self._send_async_requests_sequence, num_seq_slots) + self._test_helper(model_name, batch_size, send_requests_func) + + def test_sequence_batch_scheduler_oldest(self): + model_name = "sequence_oldest" + # With sufficient queue delay set, we expect full batches to be executed + batch_size = self.max_batch_size + num_seq_slots = batch_size + send_requests_func = partial(self._send_async_requests_sequence, num_seq_slots) + self._test_helper(model_name, batch_size, send_requests_func) + + def test_ensemble_scheduler(self): + ensemble_model_name = "ensemble" + composing_model_names = ["dynamic_composing", "default_composing"] + ensemble_queue_size = QUEUE_METRIC_TEMPLATE.format( + model_name=ensemble_model_name + ) + composing_queue_sizes = [ + QUEUE_METRIC_TEMPLATE.format(model_name=name) + for name in composing_model_names + ] + ensemble_infer_count = INFER_METRIC_TEMPLATE.format( + model_name=ensemble_model_name + ) + composing_infer_counts = [ + INFER_METRIC_TEMPLATE.format(model_name=name) + for name in composing_model_names + ] + + # Metric should be zero before sending any requests + self._assert_metric_equals(ensemble_queue_size, 0) + for queue_size in composing_queue_sizes: + self._assert_metric_equals(queue_size, 0) + # Send some ensemble requests + futures = [] + self._send_async_requests(ensemble_model_name, self.ensemble_inputs, futures) + # Give Triton time to pass some requests to composing models. This test + # is less comprehensive on checking exact queue values, and just verifies + # each composing queue gets filled and ensemble's queue is empty. + time.sleep(1) + + # Top-level ensemble size should still be zero, as all pending requests should + # be scheduled and reflected in composing models, and not considered "pending" at ensemble level. + self._assert_metric_equals(ensemble_queue_size, 0) + # Composing models should be non-zero + for queue_size in composing_queue_sizes: + self._assert_metric_greater_than(queue_size, 0) + + # Verify no inference exceptions were raised and let composing models + # finish their requests + for future in futures: + future.get_result() + + # Check that all queues are empty after getting results + self._assert_metric_equals(ensemble_queue_size, 0) + for queue_size in composing_queue_sizes: + self._assert_metric_equals(queue_size, 0) + + # Sanity check infer counts on ensemble and composing models + self._assert_metric_equals(ensemble_infer_count, self.num_requests) + for infer_count in composing_infer_counts: + self._assert_metric_equals(infer_count, self.num_requests) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_metrics/test.sh b/qa/L0_metrics/test.sh index 46059ef96a..dea1c62041 100755 --- a/qa/L0_metrics/test.sh +++ b/qa/L0_metrics/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -42,14 +42,49 @@ MODELDIR=`pwd`/models DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver -SERVER_ARGS="--model-repository=${MODELDIR}" +BASE_SERVER_ARGS="--model-repository=${MODELDIR}" +SERVER_ARGS="${BASE_SERVER_ARGS}" SERVER_LOG="./inference_server.log" source ../common/util.sh -rm -f $SERVER_LOG +CLIENT_LOG="client.log" +TEST_RESULT_FILE="test_results.txt" +function check_unit_test() { + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 + else + EXPECTED_NUM_TESTS="${1:-1}" + check_test_results ${TEST_RESULT_FILE} ${EXPECTED_NUM_TESTS} + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi +} + +function run_and_check_server() { + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi +} +rm -f $SERVER_LOG RET=0 +if [ `ps | grep -c "tritonserver"` != "0" ]; then + echo -e "Tritonserver already running" + echo -e `ps | grep tritonserver` + exit 1 +fi + +### UNIT TESTS + TEST_LOG="./metrics_api_test.log" UNIT_TEST=./metrics_api_test @@ -65,6 +100,8 @@ if [ $? -ne 0 ]; then fi set -e +### GPU Metrics + # Prepare a libtorch float32 model with basic config rm -rf $MODELDIR model=libtorch_float32_float32_float32 @@ -77,12 +114,7 @@ mkdir -p $MODELDIR/${model}/1 && \ set +e export CUDA_VISIBLE_DEVICES=0,1,2 -run_server -if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" - cat $SERVER_LOG - exit 1 -fi +run_and_check_server num_gpus=`curl -s localhost:8002/metrics | grep "nv_gpu_utilization{" | wc -l` if [ $num_gpus -ne 3 ]; then @@ -95,12 +127,7 @@ kill $SERVER_PID wait $SERVER_PID export CUDA_VISIBLE_DEVICES=0 -run_server -if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" - cat $SERVER_LOG - exit 1 -fi +run_and_check_server num_gpus=`curl -s localhost:8002/metrics | grep "nv_gpu_utilization{" | wc -l` if [ $num_gpus -ne 1 ]; then @@ -118,13 +145,8 @@ METRICS_INTERVAL_MS=500 # the update is not ready for unexpected reason WAIT_INTERVAL_SECS=0.6 -SERVER_ARGS="$SERVER_ARGS --metrics-interval-ms=${METRICS_INTERVAL_MS}" -run_server -if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start $SERVER\n***" - cat $SERVER_LOG - exit 1 -fi +SERVER_ARGS="$BASE_SERVER_ARGS --metrics-interval-ms=${METRICS_INTERVAL_MS}" +run_and_check_server num_iterations=10 @@ -155,8 +177,182 @@ for (( i = 0; i < $num_iterations; ++i )); do prev_energy=$current_energy done +### CPU / RAM Metrics + +# The underlying values for these metrics do not always update frequently, +# so give ample WAIT time to make sure they change and are being updated. +CPU_METRICS="nv_cpu_utilization nv_cpu_memory_used_bytes" +WAIT_INTERVAL_SECS=2.0 +for metric in ${CPU_METRICS}; do + echo -e "\n=== Checking Metric: ${metric} ===\n" + prev_value=`curl -s localhost:8002/metrics | grep ${metric} | grep -v "HELP\|TYPE" | awk '{print $2}'` + + num_not_updated=0 + num_not_updated_threshold=3 + for (( i = 0; i < $num_iterations; ++i )); do + sleep $WAIT_INTERVAL_SECS + current_value=`curl -s localhost:8002/metrics | grep ${metric} | grep -v "HELP\|TYPE" | awk '{print $2}'` + if [ $current_value == $prev_value ]; then + num_not_updated=$((num_not_updated+1)) + fi + prev_value=$current_value + done + + # Give CPU metrics some tolerance to not update, up to a threshold + # DLIS-4304: An alternative may be to run some busy work on CPU in the + # background rather than allowing a tolerance threshold + if [[ ${num_not_updated} -gt ${num_not_updated_threshold} ]]; then + cat $SERVER_LOG + echo "Metrics were not updated ${num_not_updated}/${num_iterations} times for interval of ${METRICS_INTERVAL_MS} milliseconds for metric: ${metric}" + echo -e "\n***\n*** Metric Interval test failed. \n***" + RET=1 + break + fi +done + +# Verify reported total memory is non-zero +total_memory=`curl -s localhost:8002/metrics | grep "nv_cpu_memory_total_bytes" | grep -v "HELP\|TYPE" | awk '{print $2}'` +test -z "${total_memory}" && total_memory=0 +if [ ${total_memory} -eq 0 ]; then + echo "Found nv_cpu_memory_total_bytes had a value of zero, this should not happen." + echo -e "\n***\n*** CPU total memory test failed. \n***" + RET=1 +fi + +kill $SERVER_PID +wait $SERVER_PID + +### Metric Config CLI and different Metric Types ### +MODELDIR="${PWD}/unit_test_models" +mkdir -p "${MODELDIR}/identity_cache_on/1" +mkdir -p "${MODELDIR}/identity_cache_off/1" +BASE_SERVER_ARGS="--model-repository=${MODELDIR} --model-control-mode=explicit" +PYTHON_TEST="metrics_config_test.py" + +# Check default settings: Counters should be enabled, summaries should be disabled +SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off" +run_and_check_server +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_exist 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +kill $SERVER_PID +wait $SERVER_PID + +# Enable summaries, counters still enabled by default +SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off --metrics-config summary_latencies=true" +run_and_check_server +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_exist 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_exist 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +kill $SERVER_PID +wait $SERVER_PID + +# Enable summaries, disable counters +SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off --metrics-config summary_latencies=true --metrics-config counter_latencies=false" +run_and_check_server +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_exist 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_missing 2>&1 | tee ${CLIENT_LOG} +check_unit_test +kill $SERVER_PID +wait $SERVER_PID + +# Enable summaries and counters, check cache metrics +CACHE_ARGS="--cache-config local,size=1048576" +SERVER_ARGS="${BASE_SERVER_ARGS} ${CACHE_ARGS} --load-model=identity_cache_on --metrics-config summary_latencies=true --metrics-config counter_latencies=true" +run_and_check_server +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_counters_exist 2>&1 | tee ${CLIENT_LOG} +check_unit_test +# DLIS-4762: Asserts that request summary is not published when cache is +# enabled for a model, until this if fixed. +python3 ${PYTHON_TEST} MetricsConfigTest.test_inf_summaries_exist_with_cache 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_counters_exist 2>&1 | tee ${CLIENT_LOG} +check_unit_test +python3 ${PYTHON_TEST} MetricsConfigTest.test_cache_summaries_exist 2>&1 | tee ${CLIENT_LOG} +check_unit_test +kill $SERVER_PID +wait $SERVER_PID + +# Check setting custom summary quantiles +export SUMMARY_QUANTILES="0.1:0.0.1,0.7:0.01,0.75:0.01" +SERVER_ARGS="${BASE_SERVER_ARGS} --load-model=identity_cache_off --metrics-config summary_latencies=true --metrics-config summary_quantiles=${SUMMARY_QUANTILES}" +run_and_check_server +python3 ${PYTHON_TEST} MetricsConfigTest.test_summaries_custom_quantiles 2>&1 | tee ${CLIENT_LOG} +check_unit_test +kill $SERVER_PID +wait $SERVER_PID + +### Pending Request Count (Queue Size) Metric Behavioral Tests ### +MODELDIR="${PWD}/queue_size_models" +SERVER_ARGS="--model-repository=${MODELDIR} --log-verbose=1" +PYTHON_TEST="metrics_queue_size_test.py" +rm -rf "${MODELDIR}" +mkdir -p "${MODELDIR}" + +# Re-use an identity model that sleeps during execution for N seconds for the +# batch of requests. Then we can confirm queue size behaviors for various +# scheduling/batching strategies. +BASE_MODEL="identity_delay" +# Don't use special debug env var for this, just set sufficient parameters for +# each scheduler to let them fill batches when possible. +unset TRITONSERVER_DELAY_SCHEDULER +export MAX_BATCH_SIZE=4 +# Delay up to 100ms to form batches up to MAX_BATCH_SIZE +export MAX_QUEUE_DELAY_US=100000 + +# Create a model per scheduler type +DEFAULT_MODEL="${MODELDIR}/default" +cp -r "${BASE_MODEL}" "${DEFAULT_MODEL}" +mkdir -p "${DEFAULT_MODEL}/1" +sed -i "s/^max_batch_size.*/max_batch_size: ${MAX_BATCH_SIZE}/" "${DEFAULT_MODEL}/config.pbtxt" + +DYNAMIC_MODEL="${MODELDIR}/dynamic" +cp -r "${DEFAULT_MODEL}" "${DYNAMIC_MODEL}" +echo -e "\ndynamic_batching { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US} }\n" >> "${DYNAMIC_MODEL}/config.pbtxt" + +MAX_QUEUE_SIZE_MODEL="${MODELDIR}/max_queue_size" +cp -r "${DEFAULT_MODEL}" "${MAX_QUEUE_SIZE_MODEL}" +echo -e "\ndynamic_batching { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US} default_queue_policy { max_queue_size: 4 } }\n" >> "${MAX_QUEUE_SIZE_MODEL}/config.pbtxt" + +SEQUENCE_DIRECT_MODEL="${MODELDIR}/sequence_direct" +cp -r "${DEFAULT_MODEL}" "${SEQUENCE_DIRECT_MODEL}" +echo -e "\nsequence_batching { direct { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US}, minimum_slot_utilization: 1.0 } }\n" >> "${SEQUENCE_DIRECT_MODEL}/config.pbtxt" + +SEQUENCE_OLDEST_MODEL="${MODELDIR}/sequence_oldest" +cp -r "${DEFAULT_MODEL}" "${SEQUENCE_OLDEST_MODEL}" +echo -e "\nsequence_batching { oldest { max_queue_delay_microseconds: ${MAX_QUEUE_DELAY_US}, max_candidate_sequences: ${MAX_BATCH_SIZE} } }\n" >> "${SEQUENCE_OLDEST_MODEL}/config.pbtxt" + +BASE_ENSEMBLE="ensemble_delay" +ENSEMBLE_MODEL="${MODELDIR}/ensemble" +cp -r "${BASE_ENSEMBLE}" "${ENSEMBLE_MODEL}" +mkdir -p "${ENSEMBLE_MODEL}/1" +# Use uniquely named composing models to avoid clashing +# metric values with individual and ensemble tests. +cp -r "${DEFAULT_MODEL}" "${MODELDIR}/default_composing" +cp -r "${DYNAMIC_MODEL}" "${MODELDIR}/dynamic_composing" + + +run_and_check_server +python3 ${PYTHON_TEST} 2>&1 | tee ${CLIENT_LOG} kill $SERVER_PID wait $SERVER_PID +expected_tests=6 +check_unit_test "${expected_tests}" if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" diff --git a/qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt b/qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt new file mode 100644 index 0000000000..863c35df07 --- /dev/null +++ b/qa/L0_metrics/unit_test_models/identity_cache_off/config.pbtxt @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "identity" +max_batch_size: 0 +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +response_cache { + enable: false +} diff --git a/qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt b/qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt new file mode 100644 index 0000000000..4bf5a7ef3b --- /dev/null +++ b/qa/L0_metrics/unit_test_models/identity_cache_on/config.pbtxt @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "identity" +max_batch_size: 0 +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +response_cache { + enable: true +} diff --git a/qa/L0_mlflow/plugin_test.py b/qa/L0_mlflow/plugin_test.py old mode 100644 new mode 100755 index 8dbf9d9146..a5d87a3c19 --- a/qa/L0_mlflow/plugin_test.py +++ b/qa/L0_mlflow/plugin_test.py @@ -27,52 +27,52 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -import sys +import json import unittest + +import numpy as np import test_util as tu from mlflow.deployments import get_deploy_client -import json -import numpy as np class PluginTest(tu.TestResultCollector): - def setUp(self): - self.client_ = get_deploy_client('triton') + self.client_ = get_deploy_client("triton") def _validate_deployment(self, model_name): # create - self.client_.create_deployment(model_name, - "models:/{}/1".format(model_name), - flavor="onnx") + self.client_.create_deployment( + model_name, "models:/{}/1".format(model_name), flavor="onnx" + ) # list deployment_list = self.client_.list_deployments() self.assertEqual(len(deployment_list), 1) - self.assertEqual(deployment_list[0]['name'], model_name) + self.assertEqual(deployment_list[0]["name"], model_name) # get deployment = self.client_.get_deployment(model_name) - self.assertEqual(deployment['name'], model_name) + self.assertEqual(deployment["name"], model_name) # predict inputs = {} with open("./mlflow-triton-plugin/examples/input.json", "r") as f: input_json = json.load(f) - for key, value in input_json['inputs'].items(): + for key, value in input_json["inputs"].items(): inputs[key] = np.array(value, dtype=np.float32) output = self.client_.predict(model_name, inputs) - with open("./mlflow-triton-plugin/examples/expected_output.json", - "r") as f: + with open("./mlflow-triton-plugin/examples/expected_output.json", "r") as f: output_json = json.load(f) - for key, value in output_json['outputs'].items(): + for key, value in output_json["outputs"].items(): np.testing.assert_allclose( - output['outputs'][key], + output["outputs"][key], np.array(value, dtype=np.int32), - err_msg='Inference result is not correct') + err_msg="Inference result is not correct", + ) # delete self.client_.delete_deployment(model_name) @@ -81,13 +81,12 @@ def test_onnx_flavor(self): # Log the ONNX model to MLFlow import mlflow.onnx import onnx + model = onnx.load( "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/1/model.onnx" ) # Use a different name to ensure the plugin operates on correct model - mlflow.onnx.log_model(model, - "triton", - registered_model_name="onnx_model") + mlflow.onnx.log_model(model, "triton", registered_model_name="onnx_model") self._validate_deployment("onnx_model") @@ -95,24 +94,28 @@ def test_onnx_flavor_with_files(self): # Log the ONNX model and additional Triton config file to MLFlow import mlflow.onnx import onnx + model = onnx.load( "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/1/model.onnx" ) - config_path = "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt" + config_path = ( + "./mlflow-triton-plugin/examples/onnx_float32_int32_int32/config.pbtxt" + ) # Use a different name to ensure the plugin operates on correct model - mlflow.onnx.log_model(model, - "triton", - registered_model_name="onnx_model_with_files") + mlflow.onnx.log_model( + model, "triton", registered_model_name="onnx_model_with_files" + ) mlflow.log_artifact(config_path, "triton") self._validate_deployment("onnx_model_with_files") # Check if the additional files are properly copied import filecmp + self.assertTrue( - filecmp.cmp(config_path, - "./models/onnx_model_with_files/config.pbtxt")) + filecmp.cmp(config_path, "./models/onnx_model_with_files/config.pbtxt") + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_mlflow/test.sh b/qa/L0_mlflow/test.sh old mode 100644 new mode 100755 index 74c9348f1d..4b5205ba25 --- a/qa/L0_mlflow/test.sh +++ b/qa/L0_mlflow/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -31,10 +31,28 @@ source ../common/util.sh rm -fr *.log *.json +# The default version of python 3.10.6 included in +# Ubuntu 22.04 installs blinker 1.4. This doesn't +# work with the awscli which we try to install. +# Uninstalling blinker and allowing pip to install blinker 1.6 +# fixes this issue. The alternative to this is to +# install a higher version of python which uses blinker 1.6, +# but it is unknown whether this test should rely on +# the default installation of python. +apt remove -y python3-blinker + RET=0 # Set up MLflow and dependencies used by the test -pip install mlflow onnx onnxruntime +pip install mlflow onnx onnxruntime boto3 + +# Install AWS CLI +if ! command -v aws --version &> /dev/null; then + curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "awscliv2.zip" + unzip awscliv2.zip + ./aws/install + rm -r ./aws/ ./awscliv2.zip +fi # Set environment variables for MLFlow and Triton plugin export MLFLOW_MODEL_REPO=./mlflow/artifacts @@ -49,14 +67,18 @@ pip install ./mlflow-triton-plugin/ python - << EOF from mlflow.tracking import MlflowClient c = MlflowClient() -for m in c.list_registered_models(): +for m in c.search_registered_models(): c.delete_registered_model(m.name) EOF rm -rf ./models mkdir -p ./models +# Put some models in model repository to make sure MLFlow plugin would ignore +# model that is not registered via MLFlow +cp -r ./mlflow-triton-plugin/examples/onnx_float32_int32_int32 ./models/existing_model + SERVER=/opt/tritonserver/bin/tritonserver -SERVER_ARGS="--model-repository=./models --strict-model-config=false --model-control-mode=explicit" +SERVER_ARGS="--model-repository=./models --strict-model-config=false --model-control-mode=explicit --load-model=*" SERVER_LOG="./inference_server.log" run_server if [ "$SERVER_PID" == "0" ]; then @@ -94,6 +116,10 @@ if [ $CLI_RET -eq 0 ]; then echo -e "\n***\n*** Expect deployed 'triton' flavor model to be listed\n***" CLI_RET=1 fi + if [ `grep -c "existing_model.*READY" $CLI_LOG` != "0" ]; then + echo -e "\n***\n*** Unexpected non-MLflow model listed\n***" + CLI_RET=1 + fi fi if [ $CLI_RET -eq 0 ]; then mlflow deployments get -t triton --name onnx_float32_int32_int32 >>$CLI_LOG 2>&1 @@ -152,6 +178,7 @@ PY_TEST=plugin_test.py TEST_RESULT_FILE='test_results.txt' python $PY_TEST >>$PY_LOG 2>&1 if [ $? -ne 0 ]; then + cat $SERVER_LOG cat $PY_LOG echo -e "\n***\n*** Python Test Failed\n***" RET=1 @@ -166,6 +193,80 @@ fi set -e kill_server + + +# +# Test S3, the setup is duplicated from L0_storage_S3, except the bucket is +# created empty +# + +# Clear mlflow registered models if any +python - << EOF +from mlflow.tracking import MlflowClient +c = MlflowClient() +for m in c.search_registered_models(): + c.delete_registered_model(m.name) +EOF + +# S3 credentials are necessary for this test. Pass via ENV variables +aws configure set default.region $AWS_DEFAULT_REGION && \ + aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \ + aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY + +# S3 bucket path (Point to bucket when testing cloud storage) +BUCKET_URL="s3://triton-bucket-${CI_JOB_ID}" + +# Cleanup and delete S3 test bucket if it already exists (due to test failure) +aws s3 rm $BUCKET_URL --recursive --include "*" && \ + aws s3 rb $BUCKET_URL || true + +# Make S3 test bucket +aws s3 mb "${BUCKET_URL}" + +# Remove Slash in BUCKET_URL +BUCKET_URL=${BUCKET_URL%/} +BUCKET_URL_SLASH="${BUCKET_URL}/" + +export TRITON_MODEL_REPO=${BUCKET_URL} +SERVER_ARGS="--model-repository=${TRITON_MODEL_REPO} --model-control-mode=explicit" +SERVER_LOG="./inference_server.s3.log" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + # Clean up bucket contents and delete bucket before exiting test + aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" + aws s3 rb "${BUCKET_URL}" + exit 1 +fi + +# ONNX flavor with Python package +set +e +PY_LOG=plugin_py.s3.log +PY_TEST=plugin_test.py +TEST_RESULT_FILE='test_results.txt' +python $PY_TEST >>$PY_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $SERVER_LOG + cat $PY_LOG + echo -e "\n***\n*** Python Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE 2 + if [ $? -ne 0 ]; then + cat $PY_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill_server + +# Clean up bucket contents and delete bucket +aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" +aws s3 rb "${BUCKET_URL}" + if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else diff --git a/qa/L0_model_config/autofill_noplatform/common/no_version/expected b/qa/L0_model_config/autofill_noplatform/common/no_version/expected index 483dbc34cb..94e9de9123 100644 --- a/qa/L0_model_config/autofill_noplatform/common/no_version/expected +++ b/qa/L0_model_config/autofill_noplatform/common/no_version/expected @@ -1 +1 @@ -unexpected platform type '' for no_version \ No newline at end of file +Invalid model name: Could not determine backend for model 'no_version' with no backend in model configuration. Expected model name of the form 'model.'. diff --git a/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/config.pbtxt b/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/config.pbtxt new file mode 100644 index 0000000000..e69de29bb2 diff --git a/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected b/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected new file mode 100644 index 0000000000..57b8cbdc02 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/custom/no_delimiter/expected @@ -0,0 +1 @@ +Invalid model name: Could not determine backend for model 'no_delimiter' with no backend in model configuration. Expected model name of the form 'model.'. diff --git a/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/config.pbtxt b/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/config.pbtxt new file mode 100644 index 0000000000..e69de29bb2 diff --git a/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected b/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected new file mode 100644 index 0000000000..e5f6d77f81 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/custom/unknown_backend.unknown/expected @@ -0,0 +1,2 @@ +Invalid argument: unable to find 'libtriton_unknown.so' or 'unknown/model.py' for model 'unknown_backend.unknown' + diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt b/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt index 2a38f51a85..8bb0896d40 100644 --- a/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt +++ b/qa/L0_model_config/autofill_noplatform/ensemble/invalid_input_map/invalid_input_map/config.pbtxt @@ -71,7 +71,7 @@ ensemble_scheduling { value: "temp_tensor_3" } input_map { - key: "INTPUT3" + key: "INPUT3" value: "temp_tensor_4" } input_map { diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected b/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected index 4dd27097c5..09561377d9 100644 --- a/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected +++ b/qa/L0_model_config/autofill_noplatform/ensemble/non_existing_model/expected @@ -1 +1 @@ -ensemble non_existing_model contains models that are not available: fp32_dim1_batch4_input4 \ No newline at end of file +ensemble non_existing_model contains models that are not available or ambiguous: fp32_dim1_batch4_input4 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt new file mode 100644 index 0000000000..61e5eee972 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/config.pbtxt @@ -0,0 +1,94 @@ +name: "unreachable_output_3" +max_batch_size: 2 +platform: "ensemble" +ensemble_scheduling { + step [ + { + model_name: "fp32_dim1_batch4" + model_version: -1 + input_map { + key: "input" + value: "data" + } + output_map { + key: "output" + value: "temp_tensor_4" + } + }, + { + model_name: "fp32_dim1_batch4" + model_version: -1 + input_map { + key: "input" + value: "not_written_tensor" + } + output_map { + key: "output" + value: "prob_2" + } + }, + { + model_name: "fp32_dim1_batch4_output3" + model_version: -1 + input_map { + key: "input" + value: "data" + } + output_map { + key: "output1" + value: "temp_tensor_1" + } + output_map { + key: "output2" + value: "temp_tensor_2" + } + output_map { + key: "output3" + value: "temp_tensor_3" + } + }, + { + model_name: "fp32_dim1_batch4_input4" + model_version: -1 + input_map { + key: "input1" + value: "temp_tensor_1" + } + input_map { + key: "input2" + value: "temp_tensor_2" + } + input_map { + key: "input3" + value: "temp_tensor_3" + } + input_map { + key: "input4" + value: "temp_tensor_4" + } + output_map { + key: "output" + value: "prob" + } + } + ] +} +input [ + { + name: "data" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] +output [ + { + name: "prob" + data_type: TYPE_FP32 + dims: [ 16 ] + }, + { + name: "prob_2" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected new file mode 100644 index 0000000000..f7add40dda --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/ensemble/unreachable_output_3/expected @@ -0,0 +1 @@ +output 'prob_2' for ensemble 'unreachable_output_3' is not written: at least one of its depending tensors, 'not_written_tensor', is not connected \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt new file mode 100644 index 0000000000..87f49cf11a --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/config.pbtxt @@ -0,0 +1,12 @@ +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 256 +} diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected new file mode 100644 index 0000000000..bd6051f9d5 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_input_dims/expected @@ -0,0 +1 @@ +model 'bad_input_dims', tensor 'input1': the model expects 2 dimensions (shape \[1,4\]) but the model configuration specifies 2 dimensions (shape \[1,256\]) \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt new file mode 100644 index 0000000000..b177c07d18 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/config.pbtxt @@ -0,0 +1,12 @@ +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 128 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} diff --git a/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected new file mode 100644 index 0000000000..2f0e5be8e2 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/bad_output_dims/expected @@ -0,0 +1 @@ +model 'bad_output_dims', tensor 'Func/PartitionedCall/output/_2:0': the model expects 2 dimensions (shape \[1,4\]) but the model configuration specifies 2 dimensions (shape \[1,128\]) \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt new file mode 100644 index 0000000000..be95f0b18a --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/config.pbtxt @@ -0,0 +1,6 @@ +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected new file mode 100644 index 0000000000..f6639e85ae --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/too_few_inputs/expected @@ -0,0 +1 @@ +unable to load model 'too_few_inputs', configuration expects 1 inputs, model provides 2 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt new file mode 100644 index 0000000000..283f498b33 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/config.pbtxt @@ -0,0 +1,18 @@ +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input_extra" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} diff --git a/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected new file mode 100644 index 0000000000..e88e97dcfb --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/too_many_inputs/expected @@ -0,0 +1 @@ +unable to load model 'too_many_inputs', configuration expects 3 inputs, model provides 2 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt new file mode 100644 index 0000000000..ed519869f3 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/config.pbtxt @@ -0,0 +1,24 @@ +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "unknown_input" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected new file mode 100644 index 0000000000..e540422197 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_input/expected @@ -0,0 +1 @@ +unexpected inference input 'unknown_input', allowed inputs are: Func/PartitionedCall/input/_0:0, input1 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt new file mode 100644 index 0000000000..202ec57eca --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/config.pbtxt @@ -0,0 +1,18 @@ +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "unknown_output" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} diff --git a/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected new file mode 100644 index 0000000000..b374338374 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/openvino/unknown_output/expected @@ -0,0 +1 @@ +unexpected inference output 'unknown_output', allowed outputs are: Func/PartitionedCall/output/_2:0, Func/PartitionedCall/output/_3:0 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py b/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py index ef24740bd6..17da02915b 100644 --- a/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/conflicting_max_batch_size/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(4) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py b/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py index d668deb544..b1399382c4 100644 --- a/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/conflicting_scheduler_sequence/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(4) auto_complete_model_config.set_dynamic_batching() diff --git a/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py b/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py index 41a80a334f..cfd6aab9d6 100644 --- a/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/input_missing_datatype/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py b/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py index 3e45521117..8c02b4ce40 100644 --- a/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/input_missing_dims/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32'} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32"} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py b/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py index 93bd36ef1f..33a76b6b30 100644 --- a/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/input_missing_name/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected index 9b34c74b2b..c91f4599ee 100644 --- a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected +++ b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/expected @@ -1 +1 @@ -input 'INPUT1' in auto-complete-config function for model 'input_wrong_property' contains property other than 'name', 'data_type' and 'dims'. +input 'INPUT1' in auto-complete-config function for model 'input_wrong_property' contains property other than 'name', 'data_type', 'dims' and 'optional'. diff --git a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py index e43008e584..f3e883db06 100644 --- a/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/input_wrong_property/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,19 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4], 'is_shape_tensor:' : True} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = { + "name": "INPUT1", + "data_type": "TYPE_FP32", + "dims": [4], + "is_shape_tensor:": True, + } + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt new file mode 100644 index 0000000000..3100235010 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/config.pbtxt @@ -0,0 +1,24 @@ +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected new file mode 100644 index 0000000000..388c6a728d --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/expected @@ -0,0 +1 @@ +model transaction property in auto-complete-config function for model 'model_transaction_policy_invalid_args' contains property other than 'decoupled' diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py new file mode 100644 index 0000000000..4de9d7c80a --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_invalid_args/model.py @@ -0,0 +1,47 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} + transaction_policy = {"invalid": "argument"} + + auto_complete_model_config.set_max_batch_size(4) + auto_complete_model_config.set_model_transaction_policy(transaction_policy) + auto_complete_model_config.add_input(input0) + auto_complete_model_config.add_input(input1) + auto_complete_model_config.add_output(output0) + auto_complete_model_config.add_output(output1) + + return auto_complete_model_config + + def execute(self, requests): + pass diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt new file mode 100644 index 0000000000..f8113f307e --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/config.pbtxt @@ -0,0 +1,28 @@ +model_transaction_policy { + decoupled: false +} + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected new file mode 100644 index 0000000000..bbdc5d2165 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/expected @@ -0,0 +1 @@ +trying to change decoupled property in auto-complete-config for model 'model_transaction_policy_mismatch', which is already set to 'False' diff --git a/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py new file mode 100644 index 0000000000..424eca60ce --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/python/model_transaction_policy_mismatch/model.py @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} + + auto_complete_model_config.set_max_batch_size(4) + auto_complete_model_config.set_model_transaction_policy(dict(decoupled=True)) + auto_complete_model_config.add_input(input0) + auto_complete_model_config.add_input(input1) + auto_complete_model_config.add_output(output0) + auto_complete_model_config.add_output(output1) + + return auto_complete_model_config + + def execute(self, requests): + pass diff --git a/qa/L0_model_config/autofill_noplatform/python/no_return/model.py b/qa/L0_model_config/autofill_noplatform/python/no_return/model.py index f22d144f47..65fae1dcc2 100644 --- a/qa/L0_model_config/autofill_noplatform/python/no_return/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/no_return/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py b/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py index 431ef1930f..26ef3e5c7e 100644 --- a/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/output_missing_datatype/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py b/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py index 6e05fcbb11..6e43928239 100644 --- a/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/output_missing_dims/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32'} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32"} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py b/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py index 2d1651431d..cde57b7827 100644 --- a/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/output_missing_name/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py b/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py index ddccf9fb4f..4dd17ea4e3 100644 --- a/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py +++ b/qa/L0_model_config/autofill_noplatform/python/output_wrong_property/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,19 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4], 'is_shape_tensor:' : True} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = { + "name": "OUTPUT1", + "data_type": "TYPE_FP32", + "dims": [4], + "is_shape_tensor:": True, + } auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/config.pbtxt similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/config.pbtxt rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/config.pbtxt diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected new file mode 100644 index 0000000000..9db37f7864 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_dims/expected @@ -0,0 +1 @@ +Internal: unable to autofill for 'bad_input_dims', model tensor configurations are contradicting each other in terms of whether batching is supported \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/config.pbtxt similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/config.pbtxt rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/config.pbtxt diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected new file mode 100644 index 0000000000..584634b2eb --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_input_type/expected @@ -0,0 +1 @@ +Invalid argument: unable to load model 'bad_input_type', configuration expects datatype TYPE_FP32 for input 'INPUT1', model provides TYPE_INT32 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/config.pbtxt similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/config.pbtxt rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/config.pbtxt diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected new file mode 100644 index 0000000000..70a0138e77 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_dims/expected @@ -0,0 +1 @@ +Invalid argument: model 'bad_output_dims', tensor 'OUTPUT1': the model expects 2 dimensions (shape \[-1,16\]) but the model configuration specifies 2 dimensions (an initial batch dimension because max_batch_size > 0 followed by the explicit tensor shape, making complete shape \[-1,1\]) \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/config.pbtxt similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/config.pbtxt rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/config.pbtxt diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected new file mode 100644 index 0000000000..bbbe1846d1 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/bad_output_type/expected @@ -0,0 +1 @@ +Invalid argument: unable to load model 'bad_output_type', configuration expects datatype TYPE_INT16 for output 'OUTPUT0', model provides TYPE_INT8 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/config.pbtxt similarity index 93% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/config.pbtxt rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/config.pbtxt index 6ba2274876..cee3e28b89 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/config.pbtxt +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/config.pbtxt @@ -11,7 +11,7 @@ input [ dims: [ 16 ] }, { - name: "INPUT_EXTRA" + name: "INPUT1" data_type: TYPE_INT32 dims: [ 16 ] } diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected new file mode 100644 index 0000000000..caaebb93a0 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/too_many_inputs/expected @@ -0,0 +1 @@ +Invalid argument: unable to load model 'too_many_inputs', configuration expects 3 inputs, model provides 2 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/too_many_inputs/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/config.pbtxt similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/config.pbtxt rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/config.pbtxt diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected new file mode 100644 index 0000000000..3f101c14fa --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_input/expected @@ -0,0 +1 @@ +Invalid argument: unexpected inference input 'INPUT_UNKNOWN', allowed inputs are: INPUT0, INPUT1 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/config.pbtxt b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/config.pbtxt similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/config.pbtxt rename to qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/config.pbtxt diff --git a/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected new file mode 100644 index 0000000000..a525ae910b --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform/tensorflow_savedmodel/unknown_output/expected @@ -0,0 +1 @@ +Invalid argument: unexpected inference output 'OUTPUT_UNKNOWN', allowed outputs are: OUTPUT0, OUTPUT1 \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected index 24bbb8f7d2..33630c195b 100644 --- a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected +++ b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_max/expected @@ -1 +1 @@ -model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_max. Error details: model expected the shape of dimension 0 to be between 4 and 32 but received 33 +model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_max. Error details: model expected the shape of dimension 1 to be between 4 and 32 but received 33 diff --git a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected index add01d771b..288d129df0 100644 --- a/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected +++ b/qa/L0_model_config/autofill_noplatform/tensorrt/bad_dynamic_shapes_min/expected @@ -1 +1 @@ -model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_min. Error details: model expected the shape of dimension 0 to be between 4 and 32 but received 3 +model configuration specified invalid shape for input 'INPUT0' for model bad_dynamic_shapes_min. Error details: model expected the shape of dimension 1 to be between 4 and 32 but received 3 diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/config.pbtxt new file mode 100644 index 0000000000..e69de29bb2 diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected b/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected new file mode 100644 index 0000000000..be092e0b0d --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/custom/empty_config.identity/expected @@ -0,0 +1,22 @@ +name: "empty_config.identity" +version_policy { +latest { + num_versions: 1 +} +} +instance_group { +name: "empty_config.identity" +count: 1 +gpus: 0 +kind: KIND_GPU +} +default_model_filename: "model.identity" +optimization { +input_pinned_memory { + enable: true +} +output_pinned_memory { + enable: true +} +} +backend: "identity" diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt new file mode 100644 index 0000000000..575da253a5 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/config.pbtxt @@ -0,0 +1,15 @@ +max_batch_size: 64 +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 1000 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 1000 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected new file mode 100644 index 0000000000..e5edfe5f9e --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/custom/no_backend.identity/expected @@ -0,0 +1,33 @@ +name: "no_backend.identity" +version_policy { +latest { + num_versions: 1 +} +} +max_batch_size: 64 +input { +name: "INPUT0" +data_type: TYPE_INT32 +dims: 1000 +} +output { +name: "OUTPUT0" +data_type: TYPE_INT32 +dims: 1000 +} +instance_group { +name: "no_backend.identity" +count: 1 +gpus: 0 +kind: KIND_GPU +} +default_model_filename: "model.identity" +optimization { +input_pinned_memory { + enable: true +} +output_pinned_memory { + enable: true +} +} +backend: "identity" diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/cpu_instance/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/onnx/cpu_instance/config.pbtxt old mode 100755 new mode 100644 diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected index c8af844b2d..fd06613612 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1 index 436e7937a2..65da68ab57 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.1 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2 index c2a4e3d863..32365f3fd4 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.2 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3 index 9f00645e90..0307a34cae 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/empty_config/expected.3 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected index d8e3a1222f..5a03128998 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1 index 74174340b5..ca1e128d12 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.1 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2 index fc75b0e0a2..fece0349ea 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.2 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3 index fb1f739756..107b9cfc3d 100644 --- a/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/onnx/no_config/expected.3 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.onnx" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/config.pbtxt new file mode 100644 index 0000000000..e69de29bb2 diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected new file mode 100644 index 0000000000..f4d0fb85bb --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected @@ -0,0 +1,45 @@ +name: "dynamic_batch" +version_policy { + latest { + num_versions: 1 + } +} +max_batch_size: 4 +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 4 +} +instance_group { + name: "dynamic_batch" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +dynamic_batching { + preferred_batch_size: 4 +} +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1 new file mode 100644 index 0000000000..4e420de350 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.1 @@ -0,0 +1,45 @@ +name: "dynamic_batch" +version_policy { + latest { + num_versions: 1 + } +} +max_batch_size: 4 +input { + name: "input1" + data_type: TYPE_INT32 + dims: 4 +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 4 +} +instance_group { + name: "dynamic_batch" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +dynamic_batching { + preferred_batch_size: 4 +} +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2 b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2 new file mode 100644 index 0000000000..f66217757d --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.2 @@ -0,0 +1,45 @@ +name: "dynamic_batch" +version_policy { + latest { + num_versions: 1 + } +} +max_batch_size: 4 +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 4 +} +instance_group { + name: "dynamic_batch" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +dynamic_batching { + preferred_batch_size: 4 +} +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3 b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3 new file mode 100644 index 0000000000..5a08b4c736 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/dynamic_batch/expected.3 @@ -0,0 +1,45 @@ +name: "dynamic_batch" +version_policy { + latest { + num_versions: 1 + } +} +max_batch_size: 4 +input { + name: "input1" + data_type: TYPE_INT32 + dims: 4 +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 4 +} +instance_group { + name: "dynamic_batch" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +dynamic_batching { + preferred_batch_size: 4 +} +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/config.pbtxt new file mode 100644 index 0000000000..e69de29bb2 diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected new file mode 100644 index 0000000000..4ff077e8ab --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected @@ -0,0 +1,45 @@ +name: "empty_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "empty_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1 new file mode 100644 index 0000000000..8c7ca01525 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.1 @@ -0,0 +1,45 @@ +name: "empty_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "empty_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2 new file mode 100644 index 0000000000..bd0cc02f27 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.2 @@ -0,0 +1,45 @@ +name: "empty_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "empty_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3 new file mode 100644 index 0000000000..745125a795 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/empty_config/expected.3 @@ -0,0 +1,45 @@ +name: "empty_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "empty_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected new file mode 100644 index 0000000000..8506cd53fb --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected @@ -0,0 +1,45 @@ +name: "no_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "no_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1 new file mode 100644 index 0000000000..f2637ede14 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.1 @@ -0,0 +1,45 @@ +name: "no_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "no_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2 new file mode 100644 index 0000000000..3c625cada5 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.2 @@ -0,0 +1,45 @@ +name: "no_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "no_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3 new file mode 100644 index 0000000000..4076982ca5 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/no_config/expected.3 @@ -0,0 +1,45 @@ +name: "no_config" +version_policy { + latest { + num_versions: 1 + } +} +input { + name: "input1" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +input { + name: "Func/PartitionedCall/input/_0:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_3:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +output { + name: "Func/PartitionedCall/output/_2:0" + data_type: TYPE_INT32 + dims: 1 + dims: 4 +} +instance_group { + name: "no_config" + count: 1 + kind: KIND_CPU +} +default_model_filename: "model.xml" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt new file mode 100644 index 0000000000..cfdc579dae --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/config.pbtxt @@ -0,0 +1,14 @@ +max_batch_size: 8 +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT8 + dims: [ 16 ] + label_filename: "output0_labels.txt" + }, + { + name: "OUTPUT1" + data_type: TYPE_INT8 + dims: [ 16 ] + } +] \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected similarity index 62% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected rename to qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected index e08c2471c5..b95f710bd9 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected @@ -1,38 +1,37 @@ -name: "unknown_input" -platform: "tensorflow_savedmodel" +name: "partial_config" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 8 input { name: "INPUT1" - data_type: TYPE_INT32 + data_type: TYPE_INT8 dims: 16 } input { name: "INPUT0" - data_type: TYPE_INT32 + data_type: TYPE_INT8 dims: 16 } output { - name: "OUTPUT1" + name: "OUTPUT0" data_type: TYPE_INT8 dims: 16 + label_filename: "output0_labels.txt" } output { - name: "OUTPUT0" + name: "OUTPUT1" data_type: TYPE_INT8 dims: 16 } instance_group { - name: "unknown_input" + name: "partial_config" count: 1 - gpus: 0 - kind: KIND_GPU + kind: KIND_CPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.xml" optimization { input_pinned_memory { enable: true @@ -41,4 +40,4 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.1 b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected.1 similarity index 62% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.1 rename to qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected.1 index c97f486287..688ac8fbf5 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/openvino/partial_config/expected.1 @@ -1,38 +1,37 @@ -name: "unknown_input" -platform: "tensorflow_savedmodel" +name: "partial_config" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 8 input { name: "INPUT0" - data_type: TYPE_INT32 + data_type: TYPE_INT8 dims: 16 } input { name: "INPUT1" - data_type: TYPE_INT32 + data_type: TYPE_INT8 dims: 16 } output { - name: "OUTPUT1" + name: "OUTPUT0" data_type: TYPE_INT8 dims: 16 + label_filename: "output0_labels.txt" } output { - name: "OUTPUT0" + name: "OUTPUT1" data_type: TYPE_INT8 dims: 16 } instance_group { - name: "unknown_input" + name: "partial_config" count: 1 - gpus: 0 - kind: KIND_GPU + kind: KIND_CPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.xml" optimization { input_pinned_memory { enable: true @@ -41,4 +40,4 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "openvino" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py index 72f588f7cb..57589bacdf 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py +++ b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/conflicting_scheduler_ensemble/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,17 +24,12 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(4) auto_complete_model_config.set_dynamic_batching() diff --git a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py index 72f588f7cb..57589bacdf 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py +++ b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_first_step/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,17 +24,12 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(4) auto_complete_model_config.set_dynamic_batching() diff --git a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py index 72f588f7cb..57589bacdf 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py +++ b/qa/L0_model_config/autofill_noplatform_success/python/conflicting_scheduler_ensemble/ensemble_second_step/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,17 +24,12 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(4) auto_complete_model_config.set_dynamic_batching() diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected index 577ce5cce4..f11fa57bf2 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected +++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected @@ -33,6 +33,7 @@ instance_group { } default_model_filename: "model.py" dynamic_batching { + preferred_batch_size: 4 } optimization { input_pinned_memory { diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1 index 4880649296..1e5a266319 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.1 @@ -33,6 +33,7 @@ instance_group { } default_model_filename: "model.py" dynamic_batching { + preferred_batch_size: 4 } optimization { input_pinned_memory { diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2 index 30bdfa2c0f..4b96c9b2a6 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.2 @@ -33,6 +33,7 @@ instance_group { } default_model_filename: "model.py" dynamic_batching { + preferred_batch_size: 4 } optimization { input_pinned_memory { diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3 index 214f8ef16d..f3c6508cab 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/expected.3 @@ -33,6 +33,7 @@ instance_group { } default_model_filename: "model.py" dynamic_batching { + preferred_batch_size: 4 } optimization { input_pinned_memory { diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py index d668deb544..b1399382c4 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py +++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(4) auto_complete_model_config.set_dynamic_batching() diff --git a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py index d668deb544..b1399382c4 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py +++ b/qa/L0_model_config/autofill_noplatform_success/python/dynamic_batching_no_op/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,14 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(4) auto_complete_model_config.set_dynamic_batching() diff --git a/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py b/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py index 48a08b10ad..75000a0ba4 100644 --- a/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py +++ b/qa/L0_model_config/autofill_noplatform_success/python/incomplete_input/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,18 +24,13 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt new file mode 100644 index 0000000000..3100235010 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/config.pbtxt @@ -0,0 +1,24 @@ +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected similarity index 51% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected index a751f6a56a..4384a240a0 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected @@ -1,38 +1,37 @@ -name: "bad_input_dims" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_input_dims" + name: "model_transaction_policy" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.1 similarity index 51% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.1 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.1 index 76e9ff1b96..0ec85aa3f2 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.1 @@ -1,38 +1,37 @@ -name: "bad_input_type" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_input_type" + name: "model_transaction_policy" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.2 similarity index 51% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.2 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.2 index 9386bf4541..db2d305cc2 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.2 @@ -1,38 +1,37 @@ -name: "bad_input_type" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_input_type" + name: "model_transaction_policy" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.3 similarity index 51% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.3 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.3 index 5361bbe5b2..2d88c5a970 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/expected.3 @@ -1,38 +1,37 @@ -name: "bad_input_type" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_input_type" + name: "model_transaction_policy" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py new file mode 100644 index 0000000000..424eca60ce --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy/model.py @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} + + auto_complete_model_config.set_max_batch_size(4) + auto_complete_model_config.set_model_transaction_policy(dict(decoupled=True)) + auto_complete_model_config.add_input(input0) + auto_complete_model_config.add_input(input1) + auto_complete_model_config.add_output(output0) + auto_complete_model_config.add_output(output1) + + return auto_complete_model_config + + def execute(self, requests): + pass diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt new file mode 100644 index 0000000000..3100235010 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/config.pbtxt @@ -0,0 +1,24 @@ +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected new file mode 100644 index 0000000000..173c66ce07 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected @@ -0,0 +1,45 @@ +name: "model_transaction_policy_decoupled_false" +version_policy { + latest { + num_versions: 1 + } +} +max_batch_size: 4 +input { + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 +} +input { + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 +} +output { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: 4 +} +output { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: 4 +} +instance_group { + name: "model_transaction_policy_decoupled_false" + count: 1 + gpus: 0 + kind: KIND_GPU +} +default_model_filename: "model.py" +optimization { + input_pinned_memory { + enable: true + } + output_pinned_memory { + enable: true + } +} +backend: "python" +model_transaction_policy { +} diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.1 similarity index 50% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.1 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.1 index 6fee8a3160..bc03df083b 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.1 @@ -1,38 +1,37 @@ -name: "bad_output_dims" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy_decoupled_false" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_output_dims" + name: "model_transaction_policy_decoupled_false" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,6 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { +} \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.2 similarity index 50% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.2 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.2 index 01d91d8868..89ddbebf8b 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.2 @@ -1,38 +1,37 @@ -name: "bad_input_dims" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy_decoupled_false" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_input_dims" + name: "model_transaction_policy_decoupled_false" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,6 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { +} \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.3 similarity index 50% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.3 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.3 index 7fa56796b1..75aefdca7f 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/expected.3 @@ -1,38 +1,37 @@ -name: "bad_output_type" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy_decoupled_false" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_output_type" + name: "model_transaction_policy_decoupled_false" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,6 @@ optimization { enable: true } } -backend: "tensorflow" \ No newline at end of file +backend: "python" +model_transaction_policy { +} \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py new file mode 100644 index 0000000000..848af2a2b2 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_decoupled_false/model.py @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} + + auto_complete_model_config.set_max_batch_size(4) + auto_complete_model_config.set_model_transaction_policy(dict(decoupled=False)) + auto_complete_model_config.add_input(input0) + auto_complete_model_config.add_input(input1) + auto_complete_model_config.add_output(output0) + auto_complete_model_config.add_output(output1) + + return auto_complete_model_config + + def execute(self, requests): + pass diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt new file mode 100644 index 0000000000..1bbf76caaf --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/config.pbtxt @@ -0,0 +1,28 @@ +model_transaction_policy { + decoupled: true +} + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + }, + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected similarity index 50% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected index 9cda30fccb..4c171e5acc 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_type/expected +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected @@ -1,38 +1,37 @@ -name: "bad_input_type" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy_no_op" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_input_type" + name: "model_transaction_policy_no_op" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.1 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.1 similarity index 50% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.1 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.1 index 896a8c2c1e..cf3a56f3a9 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.1 @@ -1,38 +1,37 @@ -name: "bad_input_dims" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy_no_op" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_input_dims" + name: "model_transaction_policy_no_op" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.2 similarity index 50% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.2 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.2 index a53a195e36..2a7e018955 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.2 @@ -1,38 +1,37 @@ -name: "bad_output_dims" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy_no_op" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_output_dims" + name: "model_transaction_policy_no_op" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.3 b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.3 similarity index 50% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.3 rename to qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.3 index 215306f8cd..4fbaae787b 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/expected.3 @@ -1,38 +1,37 @@ -name: "bad_output_dims" -platform: "tensorflow_savedmodel" +name: "model_transaction_policy_no_op" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 +max_batch_size: 4 input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT1" + data_type: TYPE_FP32 + dims: 4 } input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + name: "INPUT0" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: 4 } output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "bad_output_dims" + name: "model_transaction_policy_no_op" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,7 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" +model_transaction_policy { + decoupled: true +} diff --git a/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py new file mode 100644 index 0000000000..424eca60ce --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/model_transaction_policy_no_op/model.py @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} + + auto_complete_model_config.set_max_batch_size(4) + auto_complete_model_config.set_model_transaction_policy(dict(decoupled=True)) + auto_complete_model_config.add_input(input0) + auto_complete_model_config.add_input(input1) + auto_complete_model_config.add_output(output0) + auto_complete_model_config.add_output(output1) + + return auto_complete_model_config + + def execute(self, requests): + pass diff --git a/qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt new file mode 100644 index 0000000000..2d2868b90e --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/config.pbtxt @@ -0,0 +1,7 @@ +input [ + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.2 b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/expected similarity index 52% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.2 rename to qa/L0_model_config/autofill_noplatform_success/python/optional_input/expected index 9dcea4093c..8bbab5a3b0 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_input/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/expected @@ -1,38 +1,37 @@ -name: "unknown_input" -platform: "tensorflow_savedmodel" +name: "optional_input" version_policy { latest { num_versions: 1 } } -max_batch_size: 1 input { name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } input { name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 + data_type: TYPE_FP32 + dims: 4 + optional: true } output { name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } output { name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 + data_type: TYPE_FP32 + dims: 4 } instance_group { - name: "unknown_input" + name: "optional_input" count: 1 gpus: 0 kind: KIND_GPU } -default_model_filename: "model.savedmodel" +default_model_filename: "model.py" optimization { input_pinned_memory { enable: true @@ -41,4 +40,4 @@ optimization { enable: true } } -backend: "tensorflow" +backend: "python" diff --git a/qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py new file mode 100644 index 0000000000..fca8e06818 --- /dev/null +++ b/qa/L0_model_config/autofill_noplatform_success/python/optional_input/model.py @@ -0,0 +1,48 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + input0 = { + "name": "INPUT0", + "data_type": "TYPE_FP32", + "dims": [4], + "optional": True, + } + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} + + auto_complete_model_config.set_max_batch_size(0) + auto_complete_model_config.add_input(input0) + auto_complete_model_config.add_output(output0) + auto_complete_model_config.add_output(output1) + + return auto_complete_model_config + + def execute(self, requests): + pass diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3 deleted file mode 100644 index e8b91f678e..0000000000 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_input_dims/expected.3 +++ /dev/null @@ -1,44 +0,0 @@ -name: "bad_input_dims" -platform: "tensorflow_savedmodel" -version_policy { - latest { - num_versions: 1 - } -} -max_batch_size: 1 -input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 -} -input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 -} -output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 -} -output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 -} -instance_group { - name: "bad_input_dims" - count: 1 - gpus: 0 - kind: KIND_GPU -} -default_model_filename: "model.savedmodel" -optimization { - input_pinned_memory { - enable: true - } - output_pinned_memory { - enable: true - } -} -backend: "tensorflow" diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected deleted file mode 100644 index 948d3a5e32..0000000000 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_dims/expected +++ /dev/null @@ -1,44 +0,0 @@ -name: "bad_output_dims" -platform: "tensorflow_savedmodel" -version_policy { - latest { - num_versions: 1 - } -} -max_batch_size: 1 -input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 -} -input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 -} -output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 -} -output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 -} -instance_group { - name: "bad_output_dims" - count: 1 - gpus: 0 - kind: KIND_GPU -} -default_model_filename: "model.savedmodel" -optimization { - input_pinned_memory { - enable: true - } - output_pinned_memory { - enable: true - } -} -backend: "tensorflow" diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected deleted file mode 100644 index 584768c4dc..0000000000 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected +++ /dev/null @@ -1,44 +0,0 @@ -name: "bad_output_type" -platform: "tensorflow_savedmodel" -version_policy { - latest { - num_versions: 1 - } -} -max_batch_size: 1 -input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 -} -input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 -} -output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 -} -output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 -} -instance_group { - name: "bad_output_type" - count: 1 - gpus: 0 - kind: KIND_GPU -} -default_model_filename: "model.savedmodel" -optimization { - input_pinned_memory { - enable: true - } - output_pinned_memory { - enable: true - } -} -backend: "tensorflow" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1 deleted file mode 100644 index eb8a279bac..0000000000 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.1 +++ /dev/null @@ -1,44 +0,0 @@ -name: "bad_output_type" -platform: "tensorflow_savedmodel" -version_policy { - latest { - num_versions: 1 - } -} -max_batch_size: 1 -input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 -} -input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 -} -output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 -} -output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 -} -instance_group { - name: "bad_output_type" - count: 1 - gpus: 0 - kind: KIND_GPU -} -default_model_filename: "model.savedmodel" -optimization { - input_pinned_memory { - enable: true - } - output_pinned_memory { - enable: true - } -} -backend: "tensorflow" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2 deleted file mode 100644 index d36280de72..0000000000 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/bad_output_type/expected.2 +++ /dev/null @@ -1,44 +0,0 @@ -name: "bad_output_type" -platform: "tensorflow_savedmodel" -version_policy { - latest { - num_versions: 1 - } -} -max_batch_size: 1 -input { - name: "INPUT1" - data_type: TYPE_INT32 - dims: 16 -} -input { - name: "INPUT0" - data_type: TYPE_INT32 - dims: 16 -} -output { - name: "OUTPUT1" - data_type: TYPE_INT8 - dims: 16 -} -output { - name: "OUTPUT0" - data_type: TYPE_INT8 - dims: 16 -} -instance_group { - name: "bad_output_type" - count: 1 - gpus: 0 - kind: KIND_GPU -} -default_model_filename: "model.savedmodel" -optimization { - input_pinned_memory { - enable: true - } - output_pinned_memory { - enable: true - } -} -backend: "tensorflow" \ No newline at end of file diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected index 8f795e196c..9773774b21 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.savedmodel" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1 index a57171a3eb..adae59e945 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.1 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.savedmodel" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2 index cececc0cdc..ea92806ad7 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.2 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.savedmodel" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3 index b2987d0d14..983c1ed344 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/empty_config/expected.3 @@ -33,6 +33,7 @@ instance_group { kind: KIND_GPU } dynamic_batching { + preferred_batch_size: 4 } default_model_filename: "model.savedmodel" optimization { diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/1/model.savedmodel/saved_model.pb similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/unknown_output/1/model.savedmodel/saved_model.pb rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/1/model.savedmodel/saved_model.pb diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/config.pbtxt b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/config.pbtxt similarity index 100% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/config.pbtxt rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/config.pbtxt diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected similarity index 91% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected index b84d053a84..f7ea4005b2 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected @@ -1,4 +1,4 @@ -name: "hint_for_no_batch" +name: "hint_for_no_batch_1" platform: "tensorflow_savedmodel" version_policy { latest { @@ -30,7 +30,7 @@ output { dims: 16 } instance_group { - name: "hint_for_no_batch" + name: "hint_for_no_batch_1" count: 1 gpus: 0 kind: KIND_GPU diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.1 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.1 similarity index 91% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.1 rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.1 index 5865093359..30455a0b7f 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.1 +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.1 @@ -1,4 +1,4 @@ -name: "hint_for_no_batch" +name: "hint_for_no_batch_1" platform: "tensorflow_savedmodel" version_policy { latest { @@ -30,7 +30,7 @@ output { dims: 16 } instance_group { - name: "hint_for_no_batch" + name: "hint_for_no_batch_1" count: 1 gpus: 0 kind: KIND_GPU diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.2 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.2 similarity index 91% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.2 rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.2 index e5bfc5fed9..bf05e9f287 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.2 +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.2 @@ -1,4 +1,4 @@ -name: "hint_for_no_batch" +name: "hint_for_no_batch_1" platform: "tensorflow_savedmodel" version_policy { latest { @@ -30,7 +30,7 @@ output { dims: 16 } instance_group { - name: "hint_for_no_batch" + name: "hint_for_no_batch_1" count: 1 gpus: 0 kind: KIND_GPU diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.3 b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.3 similarity index 91% rename from qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.3 rename to qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.3 index a98f07631f..4bd3165b18 100644 --- a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch/expected.3 +++ b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_1/expected.3 @@ -1,4 +1,4 @@ -name: "hint_for_no_batch" +name: "hint_for_no_batch_1" platform: "tensorflow_savedmodel" version_policy { latest { @@ -30,7 +30,7 @@ output { dims: 16 } instance_group { - name: "hint_for_no_batch" + name: "hint_for_no_batch_1" count: 1 gpus: 0 kind: KIND_GPU diff --git a/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/1/model.savedmodel/saved_model.pb b/qa/L0_model_config/autofill_noplatform_success/tensorflow_savedmodel/hint_for_no_batch_2/1/model.savedmodel/saved_model.pb new file mode 100644 index 0000000000000000000000000000000000000000..a76abafbf76990f47d12e0697e119113c0940adb GIT binary patch literal 1407 zcmb_b&2G~`5Uw5DaVKq(PC-IjPCgK1iPX-~w~}z^C9SY?;SweGmPS~1WUndVL3kUU zg)7g1kYLw;)FGf0;mhpo`kU{YKV8Ca0AFPMEQ15Biy%M^qz{JV3A^EzaQl&4;|%zv z!ZvH_^r1UC>YhrnqN%Mz9osMkWxPlk9tyDHCca1babqZxlzGMx!JmMP(VsA(5XSz!3DyfJSV^HVB}uqIJfEm=0)h#tO&a3}qA;L+3hN`1Cdo1DcRt z{hJyH#l|rdhniGPZx?Hdg&~R~KaoTM+-&*40-WR(Fw~SL@2Ls)&>jt~7l}U_!E%sA z@1poF8sH}l)^O~-nz~o7=aC8+GLK!a+l+RT@vqo_(*tk#2u&s^=7==2ZMwGE7QK5&t|GQOdx@e&&0tp3wf8Sy zB?5d<#}_W|Nj}9yBw<21v_fR}-jMR~6mR(m%a*n`TSa15Bs`n{PvSzioU*H#mycP! zNTkSdZ^2cMG}sPm<5tkRpZk{s4-8p9GrvpF6RWd|-p&Jhv&ce*UnQ_WE4Sns^crj9 zSp63HeHChijavx&4+tDVyQ> $CLIENT_LOG + + run_server + if [ "$SERVER_PID" != "0" ]; then + echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG + RET=1 + kill $SERVER_PID + wait $SERVER_PID + else + EXFOUND=0 + for EXPECTED in `ls $EXPECTEDS`; do + EX=`cat $EXPECTED` + echo "grepping for: $EX" + if grep "$EX" $SERVER_LOG; then + echo -e "Found \"$EX\"" >> $CLIENT_LOG + EXFOUND=1 + break + else + echo -e "Not found \"$EX\"" >> $CLIENT_LOG + fi + done + if [ "$EXFOUND" == "0" ]; then + echo -e "*** FAILED: cli_messages/$TARGET" >> $CLIENT_LOG + RET=1 + fi + fi +done + # Run special test cases for TARGET in `ls special_cases`; do - SERVER_ARGS="--model-repository=`pwd`/models --strict-model-config=true" + case $TARGET in + "invalid_platform") + EXTRA_ARGS="--disable-auto-complete-config" ;; + *) + EXTRA_ARGS="" ;; + esac + + SERVER_ARGS="--model-repository=`pwd`/models $EXTRA_ARGS" SERVER_LOG=$SERVER_LOG_BASE.special_case_${TARGET}.log rm -fr models && mkdir models @@ -288,6 +391,34 @@ for TARGET in `ls special_cases`; do fi done +# Run noautofill unittest +SERVER_ARGS="--model-repository=`pwd`/models --model-control-mode=explicit --log-verbose=1" +SERVER_LOG=$SERVER_LOG_BASE.special_case_noautofill_test.log + +rm -fr models && mkdir models +cp -r special_cases/noautofill_noconfig models/. + +echo -e "Test on special_cases/noautofill_test" >> $CLIENT_LOG + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python noautofill_test.py >> $CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Python NoAutoFill Test Failed\n***" + RET=1 +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + for TRIAL in $TRIALS; do # Run all tests that require no autofill but that add the platform to # the model config before running the test @@ -339,6 +470,57 @@ for TRIAL in $TRIALS; do done done +for TRIAL in $TRIALS; do + # Run all tests that require no autofill but that add the platform to + # the model config before running the test + for TARGET in `ls noautofill_platform`; do + SERVER_ARGS="--model-repository=`pwd`/models --disable-auto-complete-config" + SERVER_LOG=$SERVER_LOG_BASE.noautofill_platform_disableflag_${TRIAL}_${TARGET}.log + + rm -fr models && mkdir models + cp -r noautofill_platform/$TARGET models/. + + CONFIG=models/$TARGET/config.pbtxt + EXPECTEDS=models/$TARGET/expected* + + # If there is a config.pbtxt change/add platform to it + if [ -f $CONFIG ]; then + sed -i '/platform:/d' $CONFIG + echo "platform: \"$TRIAL\"" >> $CONFIG + cat $CONFIG + fi + + echo -e "Test platform $TRIAL on noautofill_platform/$TARGET with disable-auto-complete-config flag" >> $CLIENT_LOG + + # We expect all the tests to fail with one of the expected + # error messages + run_server + if [ "$SERVER_PID" != "0" ]; then + echo -e "*** FAILED: unexpected success starting $SERVER" >> $CLIENT_LOG + RET=1 + kill $SERVER_PID + wait $SERVER_PID + else + EXFOUND=0 + for EXPECTED in `ls $EXPECTEDS`; do + EX=`cat $EXPECTED` + if grep ^E[0-9][0-9][0-9][0-9].*"$EX" $SERVER_LOG; then + echo -e "Found \"$EX\"" >> $CLIENT_LOG + EXFOUND=1 + break + else + echo -e "Not found \"$EX\"" >> $CLIENT_LOG + fi + done + + if [ "$EXFOUND" == "0" ]; then + echo -e "*** FAILED: platform $TRIAL noautofill_platform/$TARGET with disable-auto-complete-config flag" >> $CLIENT_LOG + RET=1 + fi + fi + done +done + # Run all autofill tests that don't add a platform to the model config # before running the test for TARGET_DIR in `ls -d autofill_noplatform/*/*`; do diff --git a/qa/L0_model_namespacing/python_addsub/__init__.py b/qa/L0_model_namespacing/python_addsub/__init__.py new file mode 100755 index 0000000000..a664eafef0 --- /dev/null +++ b/qa/L0_model_namespacing/python_addsub/__init__.py @@ -0,0 +1,123 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + # Use auto complete feature to ship config.pbtxt along with the Python + # model definition + @staticmethod + def auto_complete_config(auto_complete_model_config): + # Only use packaged config if config is not explicitly provided + config = auto_complete_model_config.as_dict() + if (len(config["input"]) != 0) or (len(config["output"]) != 0): + return auto_complete_model_config + + auto_complete_model_config.add_input( + { + "name": "INPUT0", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + auto_complete_model_config.add_input( + { + "name": "INPUT1", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + auto_complete_model_config.add_output( + { + "name": "OUTPUT0", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + auto_complete_model_config.add_output( + { + "name": "OUTPUT1", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + return auto_complete_model_config + + def initialize(self, args): + self.model_config = model_config = json.loads(args["model_config"]) + + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") + + self.output0_dtype = pb_utils.triton_string_to_numpy( + output0_config["data_type"] + ) + self.output1_dtype = pb_utils.triton_string_to_numpy( + output1_config["data_type"] + ) + + def execute(self, requests): + """This function is called on inference request.""" + + responses = [] + for request in requests: + in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") + in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") + responses.append(pb_utils.InferenceResponse(self.addsub(in_0, in_1))) + return responses + + def addsub(self, in_0, in_1): + if ( + in_0.as_numpy().dtype.type is np.bytes_ + or in_0.as_numpy().dtype == np.object_ + ): + out_0, out_1 = ( + in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32), + in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32), + ) + else: + out_0, out_1 = ( + in_0.as_numpy() + in_1.as_numpy(), + in_0.as_numpy() - in_1.as_numpy(), + ) + + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(self.output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(self.output1_dtype)) + return [out_tensor_0, out_tensor_1] diff --git a/qa/L0_model_namespacing/python_subadd/__init__.py b/qa/L0_model_namespacing/python_subadd/__init__.py new file mode 100755 index 0000000000..bd3ddefe9e --- /dev/null +++ b/qa/L0_model_namespacing/python_subadd/__init__.py @@ -0,0 +1,123 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + # Use auto complete feature to ship config.pbtxt along with the Python + # model definition + @staticmethod + def auto_complete_config(auto_complete_model_config): + # Only use packaged config if config is not explicitly provided + config = auto_complete_model_config.as_dict() + if (len(config["input"]) != 0) or (len(config["output"]) != 0): + return auto_complete_model_config + + auto_complete_model_config.add_input( + { + "name": "INPUT0", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + auto_complete_model_config.add_input( + { + "name": "INPUT1", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + auto_complete_model_config.add_output( + { + "name": "OUTPUT0", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + auto_complete_model_config.add_output( + { + "name": "OUTPUT1", + "data_type": "TYPE_INT32", + "dims": [ + 16, + ], + } + ) + return auto_complete_model_config + + def initialize(self, args): + self.model_config = model_config = json.loads(args["model_config"]) + + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") + + self.output0_dtype = pb_utils.triton_string_to_numpy( + output0_config["data_type"] + ) + self.output1_dtype = pb_utils.triton_string_to_numpy( + output1_config["data_type"] + ) + + def execute(self, requests): + """This function is called on inference request.""" + + responses = [] + for request in requests: + in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") + in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") + responses.append(pb_utils.InferenceResponse(self.subadd(in_0, in_1))) + return responses + + def subadd(self, in_0, in_1): + if ( + in_0.as_numpy().dtype.type is np.bytes_ + or in_0.as_numpy().dtype == np.object_ + ): + out_0, out_1 = ( + in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32), + in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32), + ) + else: + out_0, out_1 = ( + in_0.as_numpy() - in_1.as_numpy(), + in_0.as_numpy() + in_1.as_numpy(), + ) + + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(self.output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(self.output1_dtype)) + return [out_tensor_0, out_tensor_1] diff --git a/qa/L0_model_namespacing/test.py b/qa/L0_model_namespacing/test.py new file mode 100755 index 0000000000..f45300d4fd --- /dev/null +++ b/qa/L0_model_namespacing/test.py @@ -0,0 +1,361 @@ +#!/usr/bin/env python +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import os +import sys + +sys.path.append(os.path.join(os.environ["TRITON_QA_ROOT_DIR"], "common")) + +import shutil +import time +import unittest + +import numpy as np +import test_util as tu +import tritonclient.http as httpclient +from tritonclient.utils import InferenceServerException + +# +# Test utilities +# + + +# Checker to perform inference on given model, expecting model to have +# [INPUT0, INPUT1] and produce [OUTPUT0, OUTPUT1] where: +# OUTPUT0 = INPUT0 + INPUT1 +# OUTPUT1 = INPUT0 - INPUT1 +class AddSubChecker: + # Optional 'checker_client' may be provided to use a different + # Triton client library, currently it must be either Triton HTTP client + # library or Triton GRPC client library + def __init__(self, checker_client=None): + # client library selection + if checker_client is None: + import tritonclient.http as checker_client + if "http" in checker_client.__name__: + self.client_ = checker_client.InferenceServerClient("localhost:8000") + else: + self.client_ = checker_client.InferenceServerClient("localhost:8001") + + # Create infer input tensors + self.inputs_ = [] + self.inputs_.append(checker_client.InferInput("INPUT0", [16], "INT32")) + self.inputs_.append(checker_client.InferInput("INPUT1", [16], "INT32")) + + # Initialize the data and expected output + input_data = np.arange(start=0, stop=16, dtype=np.int32) + self.inputs_[0].set_data_from_numpy(input_data) + self.inputs_[1].set_data_from_numpy(input_data) + self.expected_outputs_ = { + "add": (input_data + input_data), + "sub": (input_data - input_data), + } + + def infer(self, model): + res = self.client_.infer(model, self.inputs_) + np.testing.assert_allclose( + res.as_numpy("OUTPUT0"), self.expected_outputs_["add"] + ) + np.testing.assert_allclose( + res.as_numpy("OUTPUT1"), self.expected_outputs_["sub"] + ) + + +# Checker to perform inference on given model, expecting model to have +# [INPUT0, INPUT1] and produce [OUTPUT0, OUTPUT1] where: +# OUTPUT0 = INPUT0 - INPUT1 +# OUTPUT1 = INPUT0 + INPUT1 +class SubAddChecker(AddSubChecker): + def infer(self, model): + res = self.client_.infer(model, self.inputs_) + np.testing.assert_allclose( + res.as_numpy("OUTPUT0"), self.expected_outputs_["sub"] + ) + np.testing.assert_allclose( + res.as_numpy("OUTPUT1"), self.expected_outputs_["add"] + ) + + +# +# Test suites and cases +# + + +class ModelNamespacePoll(tu.TestResultCollector): + def setUp(self): + self.addsub_ = AddSubChecker() + self.subadd_ = SubAddChecker() + # For other server interaction + self.client_ = httpclient.InferenceServerClient("localhost:8000") + + def check_health(self, expect_live=True, expect_ready=True): + self.assertEqual(self.client_.is_server_live(), expect_live) + self.assertEqual(self.client_.is_server_ready(), expect_ready) + + def test_no_duplication(self): + # Enable model namspacing on repositories that is already valid without + # enabling model namespacing. + # All models should be visible and can be inferred individually + self.check_health() + + # infer check + for model in ["simple_addsub", "composing_addsub"]: + self.addsub_.infer(model) + for model in ["simple_subadd", "composing_subadd"]: + self.subadd_.infer(model) + + def test_duplication(self): + # Enable model namspacing on repositories that each repo has one + # ensemble and it requires an composing model ('composing_model') that + # exists in both repos. + # Expect all models are visible, the ensemble will pick up the correct + # model even the composing model can't be inferred individually. + self.check_health() + + # infer check + for model in [ + "simple_addsub", + ]: + self.addsub_.infer(model) + for model in [ + "simple_subadd", + ]: + self.subadd_.infer(model) + + # error check + try: + self.addsub_.infer("composing_model") + self.assertTrue(False, "expected error for inferring ambiguous named model") + except InferenceServerException as ex: + self.assertIn("ambiguity", ex.message()) + + def test_ensemble_duplication(self): + # Enable model namspacing on repositories that each repo has one + # ensemble with the same name. Expect the ensemble will pick up the correct + # model. + # Expect all models are visible, the ensemble will pick up the correct + # model even the ensemble itself can't be inferred without providing + # namespace. + self.check_health() + + # infer + for model in [ + "composing_addsub", + ]: + self.addsub_.infer(model) + for model in [ + "composing_subadd", + ]: + self.subadd_.infer(model) + + # error check + try: + self.addsub_.infer("simple_ensemble") + self.assertTrue(False, "expected error for inferring ambiguous named model") + except InferenceServerException as ex: + self.assertIn("ambiguity", ex.message()) + + def test_dynamic_resolution(self): + # Same model setup as 'test_duplication', will remove / add one of the + # composing model at runtime and expect the ensemble to be properly + # linked to existing composing model at different steps. + # 1. Remove 'composing_model' in addsub_repo, expect both ensembles use + # 'composing_model' in subadd_repo and act as subadd + # 2. Add back 'composing_model' in addsub_repo, expect the ensembles to behave the + # same as before the removal. + self.assertTrue("NAMESPACE_TESTING_DIRCTORY" in os.environ) + td = os.environ["NAMESPACE_TESTING_DIRCTORY"] + composing_before_path = os.path.join(td, "addsub_repo", "composing_model") + composing_after_path = os.path.join(td, "composing_model") + + self.check_health() + # step 1. + shutil.move(composing_before_path, composing_after_path) + time.sleep(5) + + # infer + for model in ["simple_subadd", "simple_addsub", "composing_model"]: + self.subadd_.infer(model) + + # step 2. + shutil.move(composing_after_path, composing_before_path) + time.sleep(5) + + # infer + for model in [ + "simple_addsub", + ]: + self.addsub_.infer(model) + for model in [ + "simple_subadd", + ]: + self.subadd_.infer(model) + + # error check + try: + self.addsub_.infer("composing_model") + self.assertTrue(False, "expected error for inferring ambiguous named model") + except InferenceServerException as ex: + self.assertIn("ambiguity", ex.message()) + + +class ModelNamespaceExplicit(tu.TestResultCollector): + def setUp(self): + self.addsub_ = AddSubChecker() + self.subadd_ = SubAddChecker() + # For other server interaction + self.client_ = httpclient.InferenceServerClient("localhost:8000") + + def check_health(self, expect_live=True, expect_ready=True): + self.assertEqual(self.client_.is_server_live(), expect_live) + self.assertEqual(self.client_.is_server_ready(), expect_ready) + + def test_no_duplication(self): + # Enable model namspacing on repositories that is already valid without + # enabling model namespacing. + # All models should be visible and can be inferred individually + self.check_health() + # load ensembles, cascadingly load composing model + for model in ["simple_addsub", "simple_subadd"]: + self.client_.load_model(model) + + # infer + for model in ["simple_addsub", "composing_addsub"]: + self.addsub_.infer(model) + for model in ["simple_subadd", "composing_subadd"]: + self.subadd_.infer(model) + + def test_duplication(self): + # Enable model namspacing on repositories that each repo has one + # ensemble and it requires an composing model ('composing_model') that + # exists in both repos. + # Expect all models are visible, the ensemble will pick up the correct + # model even the composing model can't be inferred individually. + self.check_health() + # load ensembles, cascadingly load composing model + for model in ["simple_addsub", "simple_subadd"]: + self.client_.load_model(model) + + # infer + for model in [ + "simple_addsub", + ]: + self.addsub_.infer(model) + for model in [ + "simple_subadd", + ]: + self.subadd_.infer(model) + + # error check + try: + self.addsub_.infer("composing_model") + self.assertTrue(False, "expected error for inferring ambiguous named model") + except InferenceServerException as ex: + self.assertIn("ambiguity", ex.message()) + + def test_ensemble_duplication(self): + # Enable model namspacing on repositories that each repo has one + # ensemble with the same name. Expect the ensemble will pick up the correct + # model. + # Expect all models are visible, the ensemble will pick up the correct + # model even the ensemble itself can't be inferred without providing + # namespace. + self.check_health() + # load ensembles, cascadingly load composing model + for model in ["simple_ensemble"]: + self.client_.load_model(model) + + # infer + for model in [ + "composing_addsub", + ]: + self.addsub_.infer(model) + for model in [ + "composing_subadd", + ]: + self.subadd_.infer(model) + + # error check + try: + self.addsub_.infer("simple_ensemble") + self.assertTrue(False, "expected error for inferring ambiguous named model") + except InferenceServerException as ex: + self.assertIn("ambiguity", ex.message()) + + def test_dynamic_resolution(self): + # Same model setup as 'test_duplication', will remove / add one of the + # composing model at runtime and expect the ensemble to be properly + # linked to existing composing model at different steps. + # 1. Remove 'composing_model' in addsub_repo, expect both ensembles use + # 'composing_model' in subadd_repo and act as subadd. + # 2. Add back 'composing_model' in addsub_repo, expect the ensembles to behave the + # same as before the removal. + self.assertTrue("NAMESPACE_TESTING_DIRCTORY" in os.environ) + td = os.environ["NAMESPACE_TESTING_DIRCTORY"] + composing_before_path = os.path.join(td, "addsub_repo", "composing_model") + composing_after_path = os.path.join(td, "composing_model") + + self.check_health() + # step 1. + shutil.move(composing_before_path, composing_after_path) + # load ensembles, cascadingly load composing model + for model in ["simple_addsub", "simple_subadd"]: + self.client_.load_model(model) + + # infer + for model in ["simple_subadd", "simple_addsub", "composing_model"]: + self.subadd_.infer(model) + + # step 2. + shutil.move(composing_after_path, composing_before_path) + # Explicitly load one of the ensembel, should still trigger cascading + # (re-)load + for model in [ + "simple_addsub", + ]: + self.client_.load_model(model) + + # infer + for model in [ + "simple_addsub", + ]: + self.addsub_.infer(model) + for model in [ + "simple_subadd", + ]: + self.subadd_.infer(model) + + # error check + try: + self.addsub_.infer("composing_model") + self.assertTrue(False, "expected error for inferring ambiguous named model") + except InferenceServerException as ex: + self.assertIn("ambiguity", ex.message()) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_model_namespacing/test.sh b/qa/L0_model_namespacing/test.sh new file mode 100755 index 0000000000..414bd3dde9 --- /dev/null +++ b/qa/L0_model_namespacing/test.sh @@ -0,0 +1,149 @@ +#!/bin/bash +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +TRITON_QA_ROOT_DIR=${TRITON_QA_ROOT_DIR:="/opt/tritonserver/qa"} +source $TRITON_QA_ROOT_DIR/common/util.sh + +RET=0 + +TEST_PY=./test.py +# tests are run individually +EXPECTED_NUM_TESTS="1" +TEST_RESULT_FILE='test_results.txt' + + +export CUDA_VISIBLE_DEVICES=0 +export TRITON_QA_ROOT_DIR=$TRITON_QA_ROOT_DIR +export TRITON_QA_PYTHON_MODEL_DIR=$TRITON_QA_ROOT_DIR/L0_model_namespacing + +rm -fr *.log + +REPO_ARGS="--model-namespacing=true --model-repository=`pwd`/test_dir/addsub_repo --model-repository=`pwd`/test_dir/subadd_repo" +POLL_ARGS="--model-control-mode=POLL --repository-poll-secs=2" +EXPLICIT_ARGS="--model-control-mode=EXPLICIT" + +SERVER=/opt/tritonserver/bin/tritonserver + +# List all tests as each test will use different repo configuration +TEST_LIST=${TEST_LIST:="test_duplication \ + test_dynamic_resolution \ + test_ensemble_duplication \ + test_no_duplication"} + +# Helper to make sure all ensemble have version directory +CURR_DIR=`pwd` +for test_name in $TEST_LIST; do + for model_dir in $CURR_DIR/$test_name/*/*; do + mkdir -p $model_dir/1 + done +done + +# Set this variable to avoid generation of '__pycache__' in the model directory, +# which will cause unintended model reload in POLLING model as Triton sees +# changes in the model directory +export PYTHONDONTWRITEBYTECODE=1 + +# Polling +for test_name in $TEST_LIST; do + TEST_SUITE="ModelNamespacePoll" + TEST_LOG="`pwd`/test.$TEST_SUITE.$test_name.log" + SERVER_LOG="./server.$TEST_SUITE.$test_name.log" + + rm -fr `pwd`/test_dir + cp -r `pwd`/$test_name `pwd`/test_dir + SERVER_ARGS="$REPO_ARGS $POLL_ARGS" + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + set +e + # Pass in the test directory as the test may modify the structure + NAMESPACE_TESTING_DIRCTORY=`pwd`/test_dir python $TEST_PY $TEST_SUITE.$test_name >>$TEST_LOG 2>&1 + if [ $? -ne 0 ]; then + RET=1 + cat $TEST_LOG + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $TEST_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi + set -e + + kill $SERVER_PID + wait $SERVER_PID +done + +# Explicit +for test_name in $TEST_LIST; do + TEST_SUITE="ModelNamespaceExplicit" + TEST_LOG="`pwd`/test.$TEST_SUITE.$test_name.log" + SERVER_LOG="./server.$TEST_SUITE.$test_name.log" + + rm -fr `pwd`/test_dir + cp -r `pwd`/$test_name `pwd`/test_dir + SERVER_ARGS="$REPO_ARGS $EXPLICIT_ARGS" + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + set +e + # Pass in the test directory as the test may modify the structure + NAMESPACE_TESTING_DIRCTORY=`pwd`/test_dir python $TEST_PY $TEST_SUITE.$test_name >>$TEST_LOG 2>&1 + if [ $? -ne 0 ]; then + RET=1 + cat $TEST_LOG + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $TEST_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi + set -e + + kill $SERVER_PID + wait $SERVER_PID +done + + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py new file mode 100644 index 0000000000..13a611e7a3 --- /dev/null +++ b/qa/L0_model_namespacing/test_duplication/addsub_repo/composing_model/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_addsub import * diff --git a/qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt b/qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt new file mode 100644 index 0000000000..245e256976 --- /dev/null +++ b/qa/L0_model_namespacing/test_duplication/addsub_repo/simple_addsub/config.pbtxt @@ -0,0 +1,90 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + + + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_model" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py new file mode 100644 index 0000000000..664c20b58f --- /dev/null +++ b/qa/L0_model_namespacing/test_duplication/subadd_repo/composing_model/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_subadd import * diff --git a/qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt b/qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt new file mode 100644 index 0000000000..85d8ec0051 --- /dev/null +++ b/qa/L0_model_namespacing/test_duplication/subadd_repo/simple_subadd/config.pbtxt @@ -0,0 +1,88 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_model" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py new file mode 100644 index 0000000000..13a611e7a3 --- /dev/null +++ b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/composing_model/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_addsub import * diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt new file mode 100644 index 0000000000..245e256976 --- /dev/null +++ b/qa/L0_model_namespacing/test_dynamic_resolution/addsub_repo/simple_addsub/config.pbtxt @@ -0,0 +1,90 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + + + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_model" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py new file mode 100644 index 0000000000..664c20b58f --- /dev/null +++ b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/composing_model/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_subadd import * diff --git a/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt new file mode 100644 index 0000000000..245e256976 --- /dev/null +++ b/qa/L0_model_namespacing/test_dynamic_resolution/subadd_repo/simple_subadd/config.pbtxt @@ -0,0 +1,90 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + + + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_model" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py new file mode 100644 index 0000000000..13a611e7a3 --- /dev/null +++ b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/composing_addsub/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_addsub import * diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt new file mode 100644 index 0000000000..2a9f0003a3 --- /dev/null +++ b/qa/L0_model_namespacing/test_ensemble_duplication/addsub_repo/simple_ensemble/config.pbtxt @@ -0,0 +1,90 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + + + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_addsub" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py new file mode 100644 index 0000000000..664c20b58f --- /dev/null +++ b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/composing_subadd/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_subadd import * diff --git a/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt new file mode 100644 index 0000000000..0ee1015f25 --- /dev/null +++ b/qa/L0_model_namespacing/test_ensemble_duplication/subadd_repo/simple_ensemble/config.pbtxt @@ -0,0 +1,90 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + + + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_subadd" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py new file mode 100644 index 0000000000..13a611e7a3 --- /dev/null +++ b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/composing_addsub/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_addsub import * diff --git a/qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt new file mode 100644 index 0000000000..2a9f0003a3 --- /dev/null +++ b/qa/L0_model_namespacing/test_no_duplication/addsub_repo/simple_addsub/config.pbtxt @@ -0,0 +1,90 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + + + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_addsub" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py new file mode 100644 index 0000000000..664c20b58f --- /dev/null +++ b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/composing_subadd/1/model.py @@ -0,0 +1,6 @@ +import os +import sys + +# load pre-defined QA model +sys.path.append(os.environ["TRITON_QA_PYTHON_MODEL_DIR"]) +from python_subadd import * diff --git a/qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt new file mode 100644 index 0000000000..0ee1015f25 --- /dev/null +++ b/qa/L0_model_namespacing/test_no_duplication/subadd_repo/simple_subadd/config.pbtxt @@ -0,0 +1,90 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 +version_policy: { all { }} + + + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_INT32 + dims: [ 16 ] + + + } +] +ensemble_scheduling { + step [ + { + model_name: "composing_subadd" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_model_queue/model_queue_test.py b/qa/L0_model_queue/model_queue_test.py old mode 100644 new mode 100755 index 42bbe9130e..e7be471f79 --- a/qa/L0_model_queue/model_queue_test.py +++ b/qa/L0_model_queue/model_queue_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,17 +27,19 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range -import time import threading +import time import unittest -import numpy as np +from builtins import range +from ctypes import * + import infer_util as iu +import numpy as np import test_util as tu from tritonclientutils import InferenceServerException -from ctypes import * _max_queue_delay_ms = 10000 @@ -44,15 +48,11 @@ class ModelQueueTest(tu.TestResultCollector): - def setUp(self): self.trials_ = [] for base in ["custom", "ensemble"]: for is_http_trial in [True, False]: - self.trials_.append({ - "base": base, - "is_http_trial": is_http_trial - }) + self.trials_.append({"base": base, "is_http_trial": is_http_trial}) global _deferred_exceptions _deferred_exceptions = [] @@ -69,33 +69,41 @@ def check_deferred_exception(self): _deferred_exceptions.pop(0) raise first_exception - def check_response(self, - bs, - dtype, - shapes, - priority, - timeout_us, - thresholds, - base="custom", - is_http_trial=True): - full_shapes = [[ - bs, - ] + shape for shape in shapes] + def check_response( + self, + bs, + dtype, + shapes, + priority, + timeout_us, + thresholds, + base="custom", + is_http_trial=True, + ): + full_shapes = [ + [ + bs, + ] + + shape + for shape in shapes + ] try: start_ms = int(round(time.time() * 1000)) - iu.infer_zero(self, - base, - bs, - dtype, - full_shapes, - full_shapes, - model_version=1, - use_http_json_tensors=False, - use_http=is_http_trial, - use_grpc=(not is_http_trial), - use_streaming=False, - priority=priority, - timeout_us=timeout_us) + iu.infer_zero( + self, + base, + bs, + dtype, + full_shapes, + full_shapes, + model_version=1, + use_http_json_tensors=False, + use_http=is_http_trial, + use_grpc=(not is_http_trial), + use_streaming=False, + priority=priority, + timeout_us=timeout_us, + ) end_ms = int(round(time.time() * 1000)) @@ -104,13 +112,21 @@ def check_response(self, if lt_ms is not None: self.assertTrue( (end_ms - start_ms) < lt_ms, - "expected less than " + str(lt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected less than " + + str(lt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) if gt_ms is not None: self.assertTrue( (end_ms - start_ms) > gt_ms, - "expected greater than " + str(gt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected greater than " + + str(gt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) except Exception as ex: self.add_deferred_exception(ex) @@ -124,15 +140,17 @@ def test_max_queue_size(self): for trial in self.trials_: preceding_thread = threading.Thread( target=self.check_response, - args=(8, dtype, shapes, 0, 0, (1999, 1000)), + args=(8, dtype, shapes, 0, 0, (5999, 1000)), ) threads = [] for i in range(10): threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 0, 0, (None, - None)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 0, 0, (None, None)), + kwargs=trial, + ) + ) preceding_thread.start() time.sleep(0.5) for t in threads: @@ -142,15 +160,27 @@ def test_max_queue_size(self): for t in threads: t.join() - # Expect at most two exception with exceeding max queue size error - for i in range(2): + # Expect exactly two exception with exceeding max queue size error + expected_exceeded_count = 2 + exceeded_count = 0 + for i in range(expected_exceeded_count): try: self.check_deferred_exception() except InferenceServerException as ex: self.assertTrue( "Exceeds maximum queue size" in ex.message(), - "Expected error message \"Exceeds maximum queue size\", got: {}" - .format(ex)) + 'Expected error message "Exceeds maximum queue size", got: {}'.format( + ex + ), + ) + exceeded_count = exceeded_count + 1 + self.assertEqual( + exceeded_count, + expected_exceeded_count, + "expected {} requests to fail with exceeded max queue size error, got {}".format( + expected_exceeded_count, exceeded_count + ), + ) try: self.check_deferred_exception() except InferenceServerException as ex: @@ -169,18 +199,26 @@ def test_policy_delay(self): try: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 0, 0, (15000, - 10000)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 0, 0, (15000, 10000)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (100, 0)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (100, 0)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (100, 0)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (100, 0)), + kwargs=trial, + ) + ) threads[0].start() time.sleep(0.2) threads[1].start() @@ -202,17 +240,26 @@ def test_policy_reject(self): for trial in self.trials_: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 0, 0, (None, None)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 0, 0, (None, None)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (100, 0)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (100, 0)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (100, 0)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (100, 0)), + kwargs=trial, + ) + ) threads[0].start() time.sleep(0.2) threads[1].start() @@ -227,8 +274,10 @@ def test_policy_reject(self): except InferenceServerException as ex: self.assertTrue( "Request timeout expired" in ex.message(), - "Expected error message \"Request timeout expired\", got: {}" - .format(ex)) + 'Expected error message "Request timeout expired", got: {}'.format( + ex + ), + ) try: self.check_deferred_exception() @@ -237,7 +286,7 @@ def test_policy_reject(self): def test_timeout_override(self): # Send requests with batch sizes 1, 1, 3 where the first request - # overrides the timout to be less than 'default_timeout_microseconds', + # overrides the timeout to be less than 'default_timeout_microseconds', # and the second and third requests are sent after the overridden # timeout. Expect the first request is timed-out and rejected before # 'default_timeout_microseconds', which makes the second and third @@ -249,18 +298,26 @@ def test_timeout_override(self): for trial in self.trials_: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 0, 100000, (None, - None)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 0, 100000, (None, None)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (100, 0)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (100, 0)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (100, 0)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (100, 0)), + kwargs=trial, + ) + ) threads[0].start() time.sleep(0.2) threads[1].start() @@ -275,8 +332,10 @@ def test_timeout_override(self): except InferenceServerException as ex: self.assertTrue( "Request timeout expired" in ex.message(), - "Expected error message \"Request timeout expired\", got: {}" - .format(ex)) + 'Expected error message "Request timeout expired", got: {}'.format( + ex + ), + ) try: self.check_deferred_exception() @@ -288,18 +347,26 @@ def test_timeout_override(self): # 'default_timeout_microseconds' and before queue delay. threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 0, 10000000, (None, - None)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 0, 10000000, (None, None)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (1100, 700)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (1100, 700)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (1100, 700)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (1100, 700)), + kwargs=trial, + ) + ) threads[0].start() time.sleep(0.2) threads[1].start() @@ -314,8 +381,10 @@ def test_timeout_override(self): except InferenceServerException as ex: self.assertTrue( "Request timeout expired" in ex.message(), - "Expected error message \"Request timeout expired\", got: {}" - .format(ex)) + 'Expected error message "Request timeout expired", got: {}'.format( + ex + ), + ) try: self.check_deferred_exception() @@ -326,17 +395,26 @@ def test_timeout_override(self): # processed only after 'default_timeout_microseconds' threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 0, 0, (None, None)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 0, 0, (None, None)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (1100, 700)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (1100, 700)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (1100, 700)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (1100, 700)), + kwargs=trial, + ) + ) threads[0].start() time.sleep(0.2) threads[1].start() @@ -351,8 +429,10 @@ def test_timeout_override(self): except InferenceServerException as ex: self.assertTrue( "Request timeout expired" in ex.message(), - "Expected error message \"Request timeout expired\", got: {}" - .format(ex)) + 'Expected error message "Request timeout expired", got: {}'.format( + ex + ), + ) try: self.check_deferred_exception() @@ -369,17 +449,72 @@ def test_priority_levels(self): for trial in self.trials_: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 0, 0, (500, 200)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (500, 200)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 0, 0, (15000, 10000)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 0, 0, (15000, 10000)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 1, 0, (100, 0)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 1, 0, (100, 0)), + kwargs=trial, + ) + ) + threads[0].start() + # wait to make sure the order is correct + time.sleep(0.1) + threads[1].start() + time.sleep(0.2) + threads[2].start() + + for t in threads: + t.join() + + try: + self.check_deferred_exception() + except InferenceServerException as ex: + self.assertTrue(False, "unexpected error {}".format(ex)) + + def test_max_priority_levels(self): + # Send 2 requests with batch sizes 2, 1 in default priority (MAX_UINT32+1). Then send + # 1 request with batch size 2 in priority 1. Expect the third request is + # place in the front of the queue and form a preferred batch with the + # first request. + dtype = np.float32 + shapes = ([16],) + MAX_UINT32_PLUS_1 = 4294967296 + for trial in self.trials_: + threads = [] + threads.append( + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 0, 0, (500, 200)), + kwargs=trial, + ) + ) + threads.append( + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, MAX_UINT32_PLUS_1, 0, (15000, 10000)), + kwargs=trial, + ) + ) + threads.append( + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 1, 0, (100, 0)), + kwargs=trial, + ) + ) threads[0].start() # wait to make sure the order is correct time.sleep(0.1) @@ -425,31 +560,47 @@ def test_priority_with_policy(self): # The expected ranges may not be rounded to accommodate # the sleep between sending requests threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 1, 0, (2000, 1000)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 1, 0, (2000, 1000)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(1, dtype, shapes, 1, 1000000, (3400, - 2400)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(1, dtype, shapes, 1, 1000000, (3400, 2400)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 1, 0, (1700, 700)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 1, 0, (1700, 700)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, dtype, shapes, 2, 2000000, (None, - None)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(2, dtype, shapes, 2, 2000000, (None, None)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(3, dtype, shapes, 2, 0, (2700, 1700)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(3, dtype, shapes, 2, 0, (2700, 1700)), + kwargs=trial, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(6, dtype, shapes, 2, 0, (15000, 10000)), - kwargs=trial)) + threading.Thread( + target=self.check_response, + args=(6, dtype, shapes, 2, 0, (15000, 10000)), + kwargs=trial, + ) + ) for t in threads: t.start() time.sleep(0.2) @@ -463,8 +614,10 @@ def test_priority_with_policy(self): except InferenceServerException as ex: self.assertTrue( "Request timeout expired" in ex.message(), - "Expected error message \"Request timeout expired\", got: {}" - .format(ex)) + 'Expected error message "Request timeout expired", got: {}'.format( + ex + ), + ) try: self.check_deferred_exception() @@ -472,5 +625,5 @@ def test_priority_with_policy(self): self.assertTrue(False, "unexpected error {}".format(ex)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_model_queue/test.sh b/qa/L0_model_queue/test.sh old mode 100644 new mode 100755 index a995e10687..577b7b7fc2 --- a/qa/L0_model_queue/test.sh +++ b/qa/L0_model_queue/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -57,7 +57,7 @@ RET=0 export CUDA_VISIBLE_DEVICES=0 # Prepare base model. Only test with custom backend as it is sufficient -rm -fr *.log *.serverlog models custom_zero_1_float32 +rm -fr *.log models custom_zero_1_float32 cp -r ../custom_models/custom_zero_1_float32 . && \ mkdir -p ./custom_zero_1_float32/1 && \ mkdir -p ./ensemble_zero_1_float32/1 @@ -82,11 +82,11 @@ rm -fr models && mkdir models && \ echo " }" >> config.pbtxt && \ echo "}" >> config.pbtxt && \ echo "parameters [" >> config.pbtxt && \ - echo "{ key: \"execute_delay_ms\"; value: { string_value: \"1000\" }}" >> config.pbtxt && \ + echo "{ key: \"execute_delay_ms\"; value: { string_value: \"5000\" }}" >> config.pbtxt && \ echo "]" >> config.pbtxt) TEST_CASE=test_max_queue_size -SERVER_LOG="./$TEST_CASE.serverlog" +SERVER_LOG="./$TEST_CASE.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -129,7 +129,7 @@ rm -fr models && mkdir models && \ echo "}" >> config.pbtxt) TEST_CASE=test_policy_delay -SERVER_LOG="./$TEST_CASE.serverlog" +SERVER_LOG="./$TEST_CASE.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -171,7 +171,7 @@ rm -fr models && mkdir models && \ echo "}" >> config.pbtxt) TEST_CASE=test_policy_reject -SERVER_LOG="./$TEST_CASE.serverlog" +SERVER_LOG="./$TEST_CASE.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -214,7 +214,7 @@ rm -fr models && mkdir models && \ echo "}" >> config.pbtxt) TEST_CASE=test_timeout_override -SERVER_LOG="./$TEST_CASE.serverlog" +SERVER_LOG="./$TEST_CASE.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -255,7 +255,51 @@ rm -fr models && mkdir models && \ echo "}" >> config.pbtxt) TEST_CASE=test_priority_levels -SERVER_LOG="./$TEST_CASE.serverlog" +SERVER_LOG="./$TEST_CASE.server.log" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +echo "Test: $TEST_CASE" >>$CLIENT_LOG + +set +e +python $MODEL_QUEUE_TEST ModelQueueTest.$TEST_CASE >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +MAX_UINT64=18446744073709551615 +MAX_UINT32_PLUS_1=4294967296 + +# test_max_priority_levels +rm -fr models && mkdir models && \ + cp -r ensemble_zero_1_float32 models/. && \ + cp -r custom_zero_1_float32 models/. && \ + (cd models/custom_zero_1_float32 && \ + echo "dynamic_batching { " >> config.pbtxt && \ + echo " preferred_batch_size: [ 4, 8 ]" >> config.pbtxt && \ + echo " max_queue_delay_microseconds: 10000000" >> config.pbtxt && \ + echo " priority_levels: $MAX_UINT64" >> config.pbtxt && \ + echo " default_priority_level: $MAX_UINT32_PLUS_1" >> config.pbtxt && \ + echo "}" >> config.pbtxt) + +TEST_CASE=test_max_priority_levels +SERVER_LOG="./$TEST_CASE.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -312,7 +356,7 @@ rm -fr models && mkdir models && \ echo "}" >> config.pbtxt) TEST_CASE=test_priority_with_policy -SERVER_LOG="./$TEST_CASE.serverlog" +SERVER_LOG="./$TEST_CASE.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" diff --git a/qa/L0_model_update/instance_update_test.py b/qa/L0_model_update/instance_update_test.py new file mode 100755 index 0000000000..a3c9ce3201 --- /dev/null +++ b/qa/L0_model_update/instance_update_test.py @@ -0,0 +1,649 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import concurrent.futures +import json +import os +import random +import time +import unittest + +import numpy as np +import tritonclient.grpc as grpcclient +from models.model_init_del.util import ( + disable_batching, + enable_batching, + get_count, + reset_count, + set_delay, + update_instance_group, + update_model_file, + update_sequence_batching, +) +from tritonclient.utils import InferenceServerException + + +class TestInstanceUpdate(unittest.TestCase): + _model_name = "model_init_del" + + def setUp(self): + # Reset counters + reset_count("initialize") + reset_count("finalize") + # Reset batching + disable_batching() + # Reset delays + set_delay("initialize", 0) + set_delay("infer", 0) + # Reset sequence batching + update_sequence_batching("") + # Initialize client + self._triton = grpcclient.InferenceServerClient("localhost:8001") + + def tearDown(self): + # Check if the test passed for this test case that is tearing down + r = self.defaultTestResult() + self._feedErrorsToResult(r, self._outcome.errors) + # Use `r = self._outcome.result` for the above, if Python >= 3.11 + passed = all(self != test_case for test_case, _ in r.errors + r.failures) + if passed: + # Do nothing if passed + return + # Best effort to reset the model state for the next test case + self._triton.unload_model(self._model_name) + time.sleep(30) # time for instances to finish unloading + + def _get_inputs(self, batching=False): + self.assertIsInstance(batching, bool) + if batching: + shape = [random.randint(1, 2), random.randint(1, 16)] + else: + shape = [random.randint(1, 16)] + inputs = [grpcclient.InferInput("INPUT0", shape, "FP32")] + inputs[0].set_data_from_numpy(np.ones(shape, dtype=np.float32)) + return inputs + + def _infer(self, batching=False): + self._triton.infer(self._model_name, self._get_inputs(batching)) + + def _concurrent_infer(self, concurrency=4, batching=False): + pool = concurrent.futures.ThreadPoolExecutor() + stop = [False] + + def repeat_infer(): + while not stop[0]: + self._infer(batching) + + infer_threads = [pool.submit(repeat_infer) for i in range(concurrency)] + + def stop_infer(): + stop[0] = True + [t.result() for t in infer_threads] + pool.shutdown() + + return stop_infer + + def _check_count(self, kind, expected_count, poll=False): + self.assertIsInstance(poll, bool) + if poll: + timeout = 30 # seconds + poll_interval = 0.1 # seconds + max_retry = timeout / poll_interval + num_retry = 0 + while num_retry < max_retry and get_count(kind) < expected_count: + time.sleep(poll_interval) + num_retry += 1 + self.assertEqual(get_count(kind), expected_count) + + def _load_model(self, instance_count, instance_config="", batching=False): + # Set batching + enable_batching() if batching else disable_batching() + # Load model + self._update_instance_count( + instance_count, 0, instance_config, batching=batching + ) + + def _update_instance_count( + self, + add_count, + del_count, + instance_config="", + wait_for_finalize=False, + batching=False, + ): + self.assertIsInstance(add_count, int) + self.assertGreaterEqual(add_count, 0) + self.assertIsInstance(del_count, int) + self.assertGreaterEqual(del_count, 0) + self.assertIsInstance(instance_config, str) + prev_initialize_count = get_count("initialize") + prev_finalize_count = get_count("finalize") + new_initialize_count = prev_initialize_count + add_count + new_finalize_count = prev_finalize_count + del_count + if len(instance_config) == 0: + prev_count = prev_initialize_count - prev_finalize_count + new_count = prev_count + add_count - del_count + instance_config = "{\ncount: " + str(new_count) + "\nkind: KIND_CPU\n}" + update_instance_group(instance_config) + self._triton.load_model(self._model_name) + self._check_count("initialize", new_initialize_count) + self._check_count("finalize", new_finalize_count, wait_for_finalize) + self._infer(batching) + + def _unload_model(self, batching=False): + prev_initialize_count = get_count("initialize") + self._triton.unload_model(self._model_name) + self._check_count("initialize", prev_initialize_count) + self._check_count("finalize", prev_initialize_count, True) + with self.assertRaises(InferenceServerException): + self._infer(batching) + + # Test add -> remove -> add an instance without batching + def test_add_rm_add_instance_no_batching(self): + self._load_model(3, batching=False) + stop = self._concurrent_infer(batching=False) + self._update_instance_count(1, 0, batching=False) # add + self._update_instance_count(0, 1, batching=False) # remove + self._update_instance_count(1, 0, batching=False) # add + stop() + self._unload_model(batching=False) + + # Test add -> remove -> add an instance with batching + def test_add_rm_add_instance_with_batching(self): + self._load_model(4, batching=True) + stop = self._concurrent_infer(batching=True) + self._update_instance_count(1, 0, batching=True) # add + self._update_instance_count(0, 1, batching=True) # remove + self._update_instance_count(1, 0, batching=True) # add + stop() + self._unload_model(batching=True) + + # Test remove -> add -> remove an instance without batching + def test_rm_add_rm_instance_no_batching(self): + self._load_model(2, batching=False) + stop = self._concurrent_infer(batching=False) + self._update_instance_count(0, 1, batching=False) # remove + self._update_instance_count(1, 0, batching=False) # add + self._update_instance_count(0, 1, batching=False) # remove + stop() + self._unload_model(batching=False) + + # Test remove -> add -> remove an instance with batching + def test_rm_add_rm_instance_with_batching(self): + self._load_model(3, batching=True) + stop = self._concurrent_infer(batching=True) + self._update_instance_count(0, 1, batching=True) # remove + self._update_instance_count(1, 0, batching=True) # add + self._update_instance_count(0, 1, batching=True) # remove + stop() + self._unload_model(batching=True) + + # Test reduce instance count to zero + def test_rm_instance_to_zero(self): + self._load_model(1) + # Setting instance group count to 0 will be overwritten to 1, so no + # instances should be created or removed. + self._update_instance_count(0, 0, "{\ncount: 0\nkind: KIND_CPU\n}") + self._unload_model() + + # Test add/remove multiple CPU instances at a time + def test_cpu_instance_update(self): + self._load_model(8) + self._update_instance_count(0, 4) # remove 4 instances + self._update_instance_count(0, 3) # remove 3 instances + self._update_instance_count(0, 0) # no change + time.sleep(0.1) # larger the gap for config.pbtxt timestamp to update + self._update_instance_count(2, 0) # add 2 instances + self._update_instance_count(5, 0) # add 5 instances + self._unload_model() + + # Test add/remove multiple GPU instances at a time + def test_gpu_instance_update(self): + self._load_model(6, "{\ncount: 6\nkind: KIND_GPU\n}") + self._update_instance_count(0, 2, "{\ncount: 4\nkind: KIND_GPU\n}") + self._update_instance_count(3, 0, "{\ncount: 7\nkind: KIND_GPU\n}") + self._unload_model() + + # Test add/remove multiple CPU/GPU instances at a time + def test_gpu_cpu_instance_update(self): + # Load model with 1 GPU instance and 2 CPU instance + self._load_model( + 3, "{\ncount: 2\nkind: KIND_CPU\n},\n{\ncount: 1\nkind: KIND_GPU\n}" + ) + # Add 2 GPU instance and remove 1 CPU instance + self._update_instance_count( + 2, 1, "{\ncount: 1\nkind: KIND_CPU\n},\n{\ncount: 3\nkind: KIND_GPU\n}" + ) + # Shuffle the instances + self._update_instance_count( + 0, 0, "{\ncount: 3\nkind: KIND_GPU\n},\n{\ncount: 1\nkind: KIND_CPU\n}" + ) + time.sleep(0.1) # larger the gap for config.pbtxt timestamp to update + # Remove 1 GPU instance and add 1 CPU instance + self._update_instance_count( + 1, 1, "{\ncount: 2\nkind: KIND_GPU\n},\n{\ncount: 2\nkind: KIND_CPU\n}" + ) + # Unload model + self._unload_model() + + # Test model instance name update + def test_instance_name_update(self): + # Load 3 instances with 2 different names + self._load_model( + 3, + '{\nname: "old_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "old_2"\ncount: 2\nkind: KIND_GPU\n}', + ) + # Change the instance names + self._update_instance_count( + 0, + 0, + '{\nname: "new_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "new_2"\ncount: 2\nkind: KIND_GPU\n}', + ) + # Unload model + self._unload_model() + + # Test instance signature grouping + def test_instance_signature(self): + # Load 2 GPU instances and 3 CPU instances + self._load_model( + 5, + '{\nname: "GPU_group"\ncount: 2\nkind: KIND_GPU\n},\n{\nname: "CPU_group"\ncount: 3\nkind: KIND_CPU\n}', + ) + # Flatten the instances representation + self._update_instance_count( + 0, + 0, + '{\nname: "CPU_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "CPU_2_3"\ncount: 2\nkind: KIND_CPU\n},\n{\nname: "GPU_1"\ncount: 1\nkind: KIND_GPU\n},\n{\nname: "GPU_2"\ncount: 1\nkind: KIND_GPU\n}', + ) + time.sleep(0.1) # larger the gap for config.pbtxt timestamp to update + # Consolidate different representations + self._update_instance_count( + 0, + 0, + '{\nname: "CPU_group"\ncount: 3\nkind: KIND_CPU\n},\n{\nname: "GPU_group"\ncount: 2\nkind: KIND_GPU\n}', + ) + time.sleep(0.1) # larger the gap for config.pbtxt timestamp to update + # Flatten the instances representation + self._update_instance_count( + 0, + 0, + '{\nname: "GPU_1"\ncount: 1\nkind: KIND_GPU\n},\n{\nname: "GPU_2"\ncount: 1\nkind: KIND_GPU\n},\n{\nname: "CPU_1"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "CPU_2"\ncount: 1\nkind: KIND_CPU\n},\n{\nname: "CPU_3"\ncount: 1\nkind: KIND_CPU\n}', + ) + # Unload model + self._unload_model() + + # Test instance update with invalid instance group config + def test_invalid_config(self): + # Load model with 8 instances + self._load_model(8) + # Set invalid config + update_instance_group("--- invalid config ---") + with self.assertRaises(InferenceServerException): + self._triton.load_model("model_init_del") + # Correct config by reducing instances to 4 + self._update_instance_count(0, 4) + # Unload model + self._unload_model() + + # Test instance update with model file changed + def test_model_file_update(self): + self._load_model(5) + update_model_file() + self._update_instance_count( + 6, 5, "{\ncount: 6\nkind: KIND_CPU\n}", wait_for_finalize=True + ) + self._unload_model() + + # Test instance update with non instance config changed in config.pbtxt + def test_non_instance_config_update(self): + self._load_model(4, batching=False) + enable_batching() + self._update_instance_count( + 2, + 4, + "{\ncount: 2\nkind: KIND_CPU\n}", + wait_for_finalize=True, + batching=True, + ) + self._unload_model(batching=True) + + # Test passing new instance config via load API + def test_load_api_with_config(self): + # Load model with 1 instance + self._load_model(1) + # Get the model config from Triton + config = self._triton.get_model_config(self._model_name, as_json=True) + self.assertIn("config", config) + self.assertIsInstance(config["config"], dict) + config = config["config"] + self.assertIn("instance_group", config) + self.assertIsInstance(config["instance_group"], list) + self.assertEqual(len(config["instance_group"]), 1) + self.assertIn("count", config["instance_group"][0]) + self.assertIsInstance(config["instance_group"][0]["count"], int) + # Add an extra instance into the model config + config["instance_group"][0]["count"] += 1 + self.assertEqual(config["instance_group"][0]["count"], 2) + # Load the extra instance via the load API + self._triton.load_model(self._model_name, config=json.dumps(config)) + self._check_count("initialize", 2) # 2 instances in total + self._check_count("finalize", 0) # no instance is removed + self._infer() + # Unload model + self._unload_model() + + # Test instance update with an ongoing inference + def test_update_while_inferencing(self): + # Load model with 1 instance + self._load_model(1) + # Add 1 instance while inferencing + set_delay("infer", 10) + update_instance_group("{\ncount: 2\nkind: KIND_CPU\n}") + with concurrent.futures.ThreadPoolExecutor() as pool: + infer_start_time = time.time() + infer_thread = pool.submit(self._infer) + time.sleep(2) # make sure inference has started + update_start_time = time.time() + update_thread = pool.submit(self._triton.load_model, self._model_name) + update_thread.result() + update_end_time = time.time() + infer_thread.result() + infer_end_time = time.time() + infer_time = infer_end_time - infer_start_time + update_time = update_end_time - update_start_time + # Adding a new instance does not depend on existing instances, so the + # ongoing inference should not block the update. + self.assertGreaterEqual(infer_time, 10.0, "Invalid infer time") + self.assertLess(update_time, 5.0, "Update blocked by infer") + self._check_count("initialize", 2) + self._check_count("finalize", 0) + self._infer() + # Unload model + self._unload_model() + + # Test inference with an ongoing instance update + def test_infer_while_updating(self): + # Load model with 1 instance + self._load_model(1) + # Infer while adding 1 instance + set_delay("initialize", 10) + update_instance_group("{\ncount: 2\nkind: KIND_CPU\n}") + with concurrent.futures.ThreadPoolExecutor() as pool: + update_start_time = time.time() + update_thread = pool.submit(self._triton.load_model, self._model_name) + time.sleep(2) # make sure update has started + infer_start_time = time.time() + infer_thread = pool.submit(self._infer) + infer_thread.result() + infer_end_time = time.time() + update_thread.result() + update_end_time = time.time() + update_time = update_end_time - update_start_time + infer_time = infer_end_time - infer_start_time + # Waiting on new instance creation should not block inference on + # existing instances. + self.assertGreaterEqual(update_time, 10.0, "Invalid update time") + self.assertLess(infer_time, 5.0, "Infer blocked by update") + self._check_count("initialize", 2) + self._check_count("finalize", 0) + self._infer() + # Unload model + self._unload_model() + + # Test instance resource requirement increase + @unittest.skipUnless( + "execution_count" in os.environ["RATE_LIMIT_MODE"], + "Rate limiter precondition not met for this test", + ) + def test_instance_resource_increase(self): + # Load model + self._load_model( + 1, + '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 2\n}\n]\n}\n}', + ) + # Increase resource requirement + self._update_instance_count( + 1, + 1, + '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 8\n}\n]\n}\n}', + ) + # Check the model is not blocked from infer due to the default resource + # possibly not updated to the larger resource requirement. + infer_count = 8 + infer_complete = [False for i in range(infer_count)] + + def infer(): + for i in range(infer_count): + self._infer() + infer_complete[i] = True + + with concurrent.futures.ThreadPoolExecutor() as pool: + infer_thread = pool.submit(infer) + time.sleep(infer_count / 2) # each infer should take < 0.5 seconds + self.assertNotIn(False, infer_complete, "Infer possibly stuck") + infer_thread.result() + # Unload model + self._unload_model() + + # Test instance resource requirement increase above explicit resource + @unittest.skipUnless( + os.environ["RATE_LIMIT_MODE"] == "execution_count_with_explicit_resource", + "Rate limiter precondition not met for this test", + ) + def test_instance_resource_increase_above_explicit(self): + # Load model + self._load_model( + 1, + '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 2\n}\n]\n}\n}', + ) + # Increase resource requirement + with self.assertRaises(InferenceServerException): + self._update_instance_count( + 0, + 0, + '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 32\n}\n]\n}\n}', + ) + # Correct the resource requirement to match the explicit resource + self._update_instance_count( + 1, + 1, + '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 10\n}\n]\n}\n}', + ) + # Unload model + self._unload_model() + + # Test instance resource requirement decrease + @unittest.skipUnless( + "execution_count" in os.environ["RATE_LIMIT_MODE"], + "Rate limiter precondition not met for this test", + ) + def test_instance_resource_decrease(self): + # Load model + self._load_model( + 1, + '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 4\n}\n]\n}\n}', + ) + # Decrease resource requirement + self._update_instance_count( + 1, + 1, + '{\ncount: 1\nkind: KIND_CPU\nrate_limiter {\nresources [\n{\nname: "R1"\ncount: 3\n}\n]\n}\n}', + ) + # Unload model + self._unload_model() + # The resource count of 3 is unique across this entire test, so check + # the server output to make sure it is printed, which ensures the + # max resource is actually decreased. + time.sleep(1) # make sure the log file is updated + log_path = os.path.join( + os.environ["MODEL_LOG_DIR"], + "instance_update_test.rate_limit_" + + os.environ["RATE_LIMIT_MODE"] + + ".server.log", + ) + with open(log_path, mode="r", encoding="utf-8", errors="strict") as f: + if os.environ["RATE_LIMIT_MODE"] == "execution_count": + # Make sure the previous max resource limit of 4 is reduced to 3 + # when no explicit limit is set. + self.assertIn("Resource: R1\t Count: 3", f.read()) + else: + # Make sure the max resource limit is never set to 3 when + # explicit limit of 10 is set. + self.assertNotIn("Resource: R1\t Count: 3", f.read()) + + _direct_sequence_batching_str = ( + "direct { }\nmax_sequence_idle_microseconds: 8000000" + ) + _oldest_sequence_batching_str = ( + "oldest { max_candidate_sequences: 4 }\nmax_sequence_idle_microseconds: 8000000" + ) + + # Test instance update for direct scheduler without any ongoing sequences + def test_direct_scheduler_update_no_ongoing_sequences(self): + self._test_scheduler_update_no_ongoing_sequences( + self._direct_sequence_batching_str + ) + + # Test instance update for direct scheduler with any ongoing sequences + def test_direct_scheduler_update_with_ongoing_sequences(self): + self._test_scheduler_update_with_ongoing_sequences( + self._direct_sequence_batching_str + ) + + # Test instance update for oldest scheduler without ongoing sequences + def test_oldest_scheduler_update_no_ongoing_sequences(self): + self._test_scheduler_update_no_ongoing_sequences( + self._oldest_sequence_batching_str + ) + + # Test instance update for oldest scheduler with ongoing sequences + def test_oldest_scheduler_update_with_ongoing_sequences(self): + self._test_scheduler_update_with_ongoing_sequences( + self._oldest_sequence_batching_str + ) + + # Helper function for testing the success of sequence instance updates + # without any ongoing sequences. + def _test_scheduler_update_no_ongoing_sequences(self, sequence_batching_str): + # Load model + update_instance_group("{\ncount: 2\nkind: KIND_CPU\n}") + update_sequence_batching(sequence_batching_str) + self._triton.load_model(self._model_name) + self._check_count("initialize", 2) + self._check_count("finalize", 0) + # Basic sequence inference + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True + ) + self._triton.infer(self._model_name, self._get_inputs(), sequence_id=1) + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True + ) + # Add 2 instances without in-flight sequence + update_instance_group("{\ncount: 4\nkind: KIND_CPU\n}") + self._triton.load_model(self._model_name) + self._check_count("initialize", 4) + self._check_count("finalize", 0) + # Basic sequence inference + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True + ) + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True + ) + # Remove 1 instance without in-flight sequence + update_instance_group("{\ncount: 3\nkind: KIND_CPU\n}") + self._triton.load_model(self._model_name) + self._check_count("initialize", 4) + self._check_count("finalize", 1, poll=True) + # Basic sequence inference + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True + ) + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True + ) + # Unload model + self._triton.unload_model(self._model_name) + self._check_count("initialize", 4) + self._check_count("finalize", 4, poll=True) + + # Helper function for testing if ongoing sequences may continue to infer on + # the same instance after the instance processing the sequence is removed + # from an instance update, which the removed instance will live until the + # sequences end. + def _test_scheduler_update_with_ongoing_sequences(self, sequence_batching_str): + # Load model + update_instance_group("{\ncount: 3\nkind: KIND_CPU\n}") + update_sequence_batching(sequence_batching_str) + self._triton.load_model(self._model_name) + self._check_count("initialize", 3) + self._check_count("finalize", 0) + # Start sequence 1 and 2 on CPU instances + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_start=True + ) + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=2, sequence_start=True + ) + # Remove all 3 CPU and add 1 GPU instance with in-flight sequences. Both + # in-flight sequences are assigned to any 2 CPU instances, so exactly 1 + # CPU instance can be removed immediately. + update_instance_group("{\ncount: 1\nkind: KIND_GPU\n}") + self._triton.load_model(self._model_name) + self._check_count("initialize", 4) # 3 CPU + 1 GPU + self._check_count("finalize", 1, poll=True) # 1 CPU + # Sequence 1 and 2 may continue to infer + self._triton.infer(self._model_name, self._get_inputs(), sequence_id=1) + self._triton.infer(self._model_name, self._get_inputs(), sequence_id=2) + self._check_count("finalize", 1) # check 2 CPU instances not removed + # Start sequence 3 on GPU instance + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=3, sequence_start=True + ) + self._check_count("finalize", 1) # check 2 CPU instances not removed + # End sequence 1 and 2 will remove the 2 CPU instances + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=1, sequence_end=True + ) + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=2, sequence_end=True + ) + self._check_count("finalize", 3, poll=True) # 3 CPU + # End sequence 3 + self._triton.infer( + self._model_name, self._get_inputs(), sequence_id=3, sequence_end=True + ) + # Unload model + self._triton.unload_model(self._model_name) + self._check_count("initialize", 4) # 3 CPU + 1 GPU + self._check_count("finalize", 4, poll=True) # 3 CPU + 1 GPU + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_model_update/test.sh b/qa/L0_model_update/test.sh new file mode 100755 index 0000000000..aa9cf7fcc1 --- /dev/null +++ b/qa/L0_model_update/test.sh @@ -0,0 +1,111 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +# This L0_model_update test should make changes to models without restarting the +# server, unless restarting the server is the only way of accomplishing the +# change. + +export CUDA_VISIBLE_DEVICES=0 +export PYTHONDONTWRITEBYTECODE="True" +export MODEL_LOG_DIR="`pwd`" + +SERVER=/opt/tritonserver/bin/tritonserver +source ../common/util.sh + +function setup_models() { + rm -rf models && mkdir models + # Basic model that log instance creation and destruction + cp -r ../python_models/model_init_del models/model_init_del && \ + mkdir models/model_init_del/1 && \ + mv models/model_init_del/model.py models/model_init_del/1 +} + +RET=0 + +# Test model instance update with rate limiting on/off and explicit resource +for RATE_LIMIT_MODE in "off" "execution_count" "execution_count_with_explicit_resource"; do + + RATE_LIMIT_ARGS="--rate-limit=$RATE_LIMIT_MODE" + if [ "$RATE_LIMIT_MODE" == "execution_count_with_explicit_resource" ]; then + RATE_LIMIT_ARGS="--rate-limit=execution_count --rate-limit-resource=R1:10" + fi + + export RATE_LIMIT_MODE=$RATE_LIMIT_MODE + TEST_LOG="instance_update_test.rate_limit_$RATE_LIMIT_MODE.log" + SERVER_LOG="./instance_update_test.rate_limit_$RATE_LIMIT_MODE.server.log" + + setup_models + SERVER_ARGS="--model-repository=models --model-control-mode=explicit $RATE_LIMIT_ARGS --log-verbose=2" + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + set +e + python instance_update_test.py > $TEST_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed model instance update test on rate limit mode $RATE_LIMIT_MODE\n***" + cat $TEST_LOG + RET=1 + fi + set -e + + kill $SERVER_PID + wait $SERVER_PID + + set +e + grep "Should not print this" $SERVER_LOG + if [ $? -eq 0 ]; then + echo -e "\n***\n*** Found \"Should not print this\" on \"$SERVER_LOG\"\n***" + cat $SERVER_LOG + RET=1 + fi + set -e + +done + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi +exit $RET diff --git a/qa/L0_multi_server/test.sh b/qa/L0_multi_server/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_nan_inf/models/nan_inf_output/1/model.py b/qa/L0_nan_inf/models/nan_inf_output/1/model.py index de610c6d3c..17cfb04fa0 100644 --- a/qa/L0_nan_inf/models/nan_inf_output/1/model.py +++ b/qa/L0_nan_inf/models/nan_inf_output/1/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,20 +25,20 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import json + import numpy as np import triton_python_backend_utils as pb_utils -class TritonPythonModel: +class TritonPythonModel: def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = json.loads(args["model_config"]) def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" responses = [] - for request in requests: + for _ in requests: # Include one of each specially parsed JSON value: nan, inf, and -inf out_0 = np.array([np.nan, np.inf, np.NINF, 1, 2, 3], dtype=np.float32) out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0) diff --git a/qa/L0_nan_inf/nan_inf_test.py b/qa/L0_nan_inf/nan_inf_test.py old mode 100644 new mode 100755 index e68bc664be..3013b03850 --- a/qa/L0_nan_inf/nan_inf_test.py +++ b/qa/L0_nan_inf/nan_inf_test.py @@ -26,45 +26,55 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys -sys.path.append('../common') + +sys.path.append("../common") import json -import unittest import traceback +import unittest -import requests import numpy as np -import tritonclient.http as tritonhttpclient +import requests +import test_util as tu import tritonclient.grpc as tritongrpcclient +import tritonclient.http as tritonhttpclient from tritonclient.utils import InferenceServerException -import test_util as tu + class NanInfTest(tu.TestResultCollector): expected_output = np.array([np.nan, np.inf, np.NINF, 1, 2, 3], dtype=np.float32) model_name = "nan_inf_output" def test_http_raw(self): - payload = {"inputs": [{"name": "INPUT0", "datatype": "FP32", "shape":[1], "data": [1]}]} - response = requests.post("http://localhost:8000/v2/models/nan_inf_output/infer", - data=json.dumps(payload)) + payload = { + "inputs": [ + {"name": "INPUT0", "datatype": "FP32", "shape": [1], "data": [1]} + ] + } + response = requests.post( + "http://localhost:8000/v2/models/nan_inf_output/infer", + data=json.dumps(payload), + ) if not response.ok: self.assertTrue(False, "Response not OK: {}".format(response.text)) try: print(response.json()) except: - self.assertTrue(False, "Response was not valid JSON:\n{}".format(response.text)) + self.assertTrue( + False, "Response was not valid JSON:\n{}".format(response.text) + ) def test_http(self): triton_client = tritonhttpclient.InferenceServerClient("localhost:8000") inputs = [] - inputs.append(tritonhttpclient.InferInput('INPUT0', [1], "FP32")) + inputs.append(tritonhttpclient.InferInput("INPUT0", [1], "FP32")) self.infer_helper(triton_client, inputs) def test_grpc(self): triton_client = tritongrpcclient.InferenceServerClient("localhost:8001") inputs = [] - inputs.append(tritongrpcclient.InferInput('INPUT0', [1], "FP32")) + inputs.append(tritongrpcclient.InferInput("INPUT0", [1], "FP32")) self.infer_helper(triton_client, inputs) def infer_helper(self, triton_client, inputs): @@ -72,16 +82,20 @@ def infer_helper(self, triton_client, inputs): try: results = triton_client.infer(model_name=self.model_name, inputs=inputs) - output0_data = results.as_numpy('OUTPUT0') + output0_data = results.as_numpy("OUTPUT0") # Verify output is as expected # Make sure nan's are equivalent when compared - output_correct = np.array_equal(output0_data, self.expected_output, equal_nan=True) - self.assertTrue(output_correct, - "didn't get expected output0: {}".format(output0_data)) + output_correct = np.array_equal( + output0_data, self.expected_output, equal_nan=True + ) + self.assertTrue( + output_correct, "didn't get expected output0: {}".format(output0_data) + ) except InferenceServerException as ex: self.assertTrue(False, ex.message()) except: self.assertTrue(False, traceback.format_exc()) -if __name__ == '__main__': + +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_nullchar_string/nullchar_string_client.py b/qa/L0_nullchar_string/nullchar_string_client.py old mode 100644 new mode 100755 index d90304856d..2d69b41b3d --- a/qa/L0_nullchar_string/nullchar_string_client.py +++ b/qa/L0_nullchar_string/nullchar_string_client.py @@ -26,47 +26,51 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse -import numpy as np +import numpy as np import tritongrpcclient as grpcclient import tritonhttpclient as httpclient from tritonclientutils import np_to_triton_dtype FLAGS = None -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() - parser.add_argument('-v', - '--verbose', - action="store_true", - required=False, - default=False, - help='Enable verbose output') - parser.add_argument('-m', - '--model-name', - type=str, - required=True, - help='Name of model') - parser.add_argument('-u', - '--url', - type=str, - required=False, - default='localhost:8000', - help='Inference server URL. Default is localhost:8000.') parser.add_argument( - '-i', - '--protocol', + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-m", "--model-name", type=str, required=True, help="Name of model" + ) + parser.add_argument( + "-u", + "--url", + type=str, + required=False, + default="localhost:8000", + help="Inference server URL. Default is localhost:8000.", + ) + parser.add_argument( + "-i", + "--protocol", type=str, required=False, - default='http', - help='Protocol ("http"/"grpc") used to ' + - 'communicate with inference service. Default is "http".') + default="http", + help='Protocol ("http"/"grpc") used to ' + + 'communicate with inference service. Default is "http".', + ) FLAGS = parser.parse_args() if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"): - print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format( - FLAGS.protocol)) + print( + 'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol) + ) exit(1) client_util = httpclient if FLAGS.protocol == "http" else grpcclient @@ -86,8 +90,9 @@ # Send inference request to the inference server. Get results for # output tensor. inputs = [ - client_util.InferInput("INPUT0", input0_data.shape, - np_to_triton_dtype(np.object_)) + client_util.InferInput( + "INPUT0", input0_data.shape, np_to_triton_dtype(np.object_) + ) ] inputs[0].set_data_from_numpy(input0_data) @@ -95,7 +100,7 @@ # We expect there to be 1 result (with batch-size 1). Compare the input # and output tensor calculated by the model. They must be the same. - output0_data = results.as_numpy('OUTPUT0') + output0_data = results.as_numpy("OUTPUT0") print(input0_data, "?=?", output0_data) assert np.equal(input0_data.astype(np.bytes_), output0_data).all() diff --git a/qa/L0_nullchar_string/test.sh b/qa/L0_nullchar_string/test.sh old mode 100644 new mode 100755 index f1c81c9aa6..bded41dc92 --- a/qa/L0_nullchar_string/test.sh +++ b/qa/L0_nullchar_string/test.sh @@ -40,16 +40,22 @@ fi export CUDA_VISIBLE_DEVICES=0 +CLIENT_LOG="./client.log" DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_identity_model_repository +MODELS="graphdef_nobatch_zero_1_object savedmodel_nobatch_zero_1_object" NULLCHAR_CLIENT_PY=nullchar_string_client.py -CLIENT_LOG="./client.log" SERVER=/opt/tritonserver/bin/tritonserver -SERVER_ARGS="--model-repository=$DATADIR" +SERVER_ARGS="--model-repository=models" SERVER_LOG="./inference_server.log" source ../common/util.sh -rm -f $CLIENT_LOG $SERVER_LOG +rm -f $CLIENT_LOG $SERVER_LOG models + +mkdir -p models +for MODEL in $MODELS; do + cp -r $DATADIR/$MODEL models/. +done run_server if [ "$SERVER_PID" == "0" ]; then @@ -65,7 +71,7 @@ set +e # Ignore ONNX backend because even though ONNX supports string data type, # strings that contain null character in the middle is not allowed. # https://github.com/microsoft/onnxruntime/issues/2284 -for MODEL in graphdef_nobatch_zero_1_object savedmodel_nobatch_zero_1_object; do +for MODEL in $MODELS; do python $NULLCHAR_CLIENT_PY -m $MODEL -v >>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then RET=1 diff --git a/qa/L0_onnx_optimization/test.sh b/qa/L0_onnx_optimization/test.sh index 7190f31515..b574f5db32 100755 --- a/qa/L0_onnx_optimization/test.sh +++ b/qa/L0_onnx_optimization/test.sh @@ -61,8 +61,11 @@ for MODEL in \ models/${MODEL}_test && \ rm -fr models/${MODEL}_test/2 && \ rm -fr models/${MODEL}_test/3 && \ + # Set instance count > 1 to test parallel instance loading across all EPs + INSTANCE_COUNT=5 (cd models/${MODEL}_test && \ - sed -i 's/_float32_float32_float32/&_test/' config.pbtxt) && \ + sed -i 's/_float32_float32_float32/&_test/' config.pbtxt && \ + echo -e "\ninstance_group { count: ${INSTANCE_COUNT} }" >> config.pbtxt) && \ # CUDA EP optimization params cp -r models/${MODEL}_test models/${MODEL}_cuda_config && \ (cd models/${MODEL}_cuda_config && \ diff --git a/qa/L0_optional_input/models/ensemble_identity_2_float32/config.pbtxt b/qa/L0_optional_input/models/ensemble_identity_2_float32/config.pbtxt old mode 100755 new mode 100644 diff --git a/qa/L0_optional_input/models/identity_2_float32/config.pbtxt b/qa/L0_optional_input/models/identity_2_float32/config.pbtxt old mode 100755 new mode 100644 diff --git a/qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt b/qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt new file mode 100644 index 0000000000..afc4ebc00f --- /dev/null +++ b/qa/L0_optional_input/models/optional_connecting_tensor/config.pbtxt @@ -0,0 +1,98 @@ +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 4 +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + }, + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] +ensemble_scheduling { + step [ + { + model_name: "optional_identity" + model_version: -1 + input_map { + key: "INPUT0" + value: "INPUT0" + } + input_map { + key: "INPUT1" + value: "INPUT1" + } + output_map { + key: "OUTPUT0" + value: "internal_output0" + } + output_map { + key: "OUTPUT1" + value: "internal_output1" + } + }, + { + model_name: "optional_identity" + model_version: -1 + input_map { + key: "INPUT0" + value: "internal_output0" + } + input_map { + key: "INPUT1" + value: "internal_output1" + } + output_map { + key: "OUTPUT0" + value: "OUTPUT0" + } + output_map { + key: "OUTPUT1" + value: "OUTPUT1" + } + } + ] +} diff --git a/qa/L0_optional_input/models/optional_identity/1/model.py b/qa/L0_optional_input/models/optional_identity/1/model.py new file mode 100644 index 0000000000..c736ecc3bd --- /dev/null +++ b/qa/L0_optional_input/models/optional_identity/1/model.py @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def execute(self, requests): + """ + Identity model in Python backend. + """ + responses = [] + for request in requests: + for tidx in ("0", "1"): + input_tensor = pb_utils.get_input_tensor_by_name( + request, "INPUT" + tidx + ) + if input_tensor is not None: + out_tensor = pb_utils.Tensor( + "OUTPUT" + tidx, input_tensor.as_numpy() + ) + responses.append(pb_utils.InferenceResponse([out_tensor])) + return responses diff --git a/qa/L0_optional_input/models/optional_identity/config.pbtxt b/qa/L0_optional_input/models/optional_identity/config.pbtxt new file mode 100644 index 0000000000..0c73fd7ca5 --- /dev/null +++ b/qa/L0_optional_input/models/optional_identity/config.pbtxt @@ -0,0 +1,53 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +backend: "python" +max_batch_size: 4 +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + }, + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 1 ] + optional: true + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + }, + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] diff --git a/qa/L0_optional_input/models/pipeline_identity_2_float32/config.pbtxt b/qa/L0_optional_input/models/pipeline_identity_2_float32/config.pbtxt old mode 100755 new mode 100644 diff --git a/qa/L0_optional_input/optional_input_test.py b/qa/L0_optional_input/optional_input_test.py old mode 100644 new mode 100755 index 5143718775..c1fd114d6b --- a/qa/L0_optional_input/optional_input_test.py +++ b/qa/L0_optional_input/optional_input_test.py @@ -1,6 +1,6 @@ #!/usr/bin/python -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -27,16 +27,17 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -import numpy as np import sys -import time import threading +import time import unittest -import tritonclient.grpc as grpcclient -from tritonclient.utils import np_to_triton_dtype + +import numpy as np import test_util as tu +import tritonclient.grpc as grpcclient _deferred_exceptions_lock = threading.Lock() _deferred_exceptions = [] @@ -44,31 +45,30 @@ # Similar set up as dynamic batcher tests class OptionalInputTest(tu.TestResultCollector): - def setUp(self): global _deferred_exceptions _deferred_exceptions = [] # The helper client for setup will be GRPC for simplicity. self.triton_client_ = grpcclient.InferenceServerClient("localhost:8001") - self.model_name_ = 'identity_2_float32' + self.model_name_ = "identity_2_float32" # This will not be changed even when ensemble is under test, # as the dynamic batching is performed within the composing model - self.check_status_model = 'identity_2_float32' + self.check_status_model = "identity_2_float32" self.tensor_shape_ = (1, 1) self.inputs_ = { - "INPUT0": grpcclient.InferInput('INPUT0', [1, 1], "FP32"), - "INPUT1": grpcclient.InferInput('INPUT1', [1, 1], "FP32") + "INPUT0": grpcclient.InferInput("INPUT0", [1, 1], "FP32"), + "INPUT1": grpcclient.InferInput("INPUT1", [1, 1], "FP32"), } self.input_data_ = { "INPUT0": np.ones(shape=(1, 1), dtype=np.float32), - "INPUT1": np.zeros(shape=(1, 1), dtype=np.float32) + "INPUT1": np.zeros(shape=(1, 1), dtype=np.float32), } self.inputs_["INPUT0"].set_data_from_numpy(self.input_data_["INPUT0"]) self.inputs_["INPUT1"].set_data_from_numpy(self.input_data_["INPUT1"]) self.outputs_ = { - "INPUT0": grpcclient.InferRequestedOutput('OUTPUT0'), - "INPUT1": grpcclient.InferRequestedOutput('OUTPUT1') + "INPUT0": grpcclient.InferRequestedOutput("OUTPUT0"), + "INPUT1": grpcclient.InferRequestedOutput("OUTPUT1"), } def add_deferred_exception(self, ex): @@ -93,9 +93,9 @@ def check_response(self, thresholds, provided_inputs=("INPUT0", "INPUT1")): outputs.append(self.outputs_[provided_input]) triton_client = grpcclient.InferenceServerClient("localhost:8001") - results = triton_client.infer(model_name=self.model_name_, - inputs=inputs, - outputs=outputs) + results = triton_client.infer( + model_name=self.model_name_, inputs=inputs, outputs=outputs + ) end_ms = int(round(time.time() * 1000)) @@ -106,66 +106,103 @@ def check_response(self, thresholds, provided_inputs=("INPUT0", "INPUT1")): self.assertTrue( np.array_equal(output_data, expected), "{}, {}, expected: {}, got {}".format( - self.model_name_, output_name, expected, output_data)) + self.model_name_, output_name, expected, output_data + ), + ) gt_ms = thresholds[0] lt_ms = thresholds[1] if lt_ms is not None: self.assertTrue( (end_ms - start_ms) < lt_ms, - "expected less than " + str(lt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected less than " + + str(lt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) if gt_ms is not None: self.assertTrue( (end_ms - start_ms) > gt_ms, - "expected greater than " + str(gt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected greater than " + + str(gt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) except Exception as ex: self.add_deferred_exception(ex) def check_status(self, model_name, batch_exec, request_cnt, infer_cnt): - stats = self.triton_client_.get_inference_statistics(model_name, "1") - self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") - self.assertEqual(stats.model_stats[0].name, model_name, - "expect model stats for model {}".format(model_name)) + # There is a time window between when responses are returned and statistics are updated. + # To prevent intermittent test failure during that window, wait up to 10 seconds for the + # inference statistics to be ready. + num_tries = 10 + for i in range(num_tries): + stats = self.triton_client_.get_inference_statistics(model_name, "1") + self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") + actual_exec_cnt = stats.model_stats[0].execution_count + if stats.model_stats[0].execution_count > 0: + break + time.sleep(1) + self.assertEqual( - stats.model_stats[0].version, "1", - "expect model stats for model {} version 1".format(model_name)) + stats.model_stats[0].name, + model_name, + "expect model stats for model {}".format(model_name), + ) + self.assertEqual( + stats.model_stats[0].version, + "1", + "expect model stats for model {} version 1".format(model_name), + ) batch_stats = stats.model_stats[0].batch_stats self.assertEqual( - len(batch_stats), len(batch_exec), + len(batch_stats), + len(batch_exec), "expected {} different batch-sizes, got {}".format( - len(batch_exec), len(batch_stats))) + len(batch_exec), len(batch_stats) + ), + ) for batch_stat in batch_stats: bs = batch_stat.batch_size bc = batch_stat.compute_infer.count - self.assertTrue(bs in batch_exec, - "unexpected batch-size {}".format(bs)) + self.assertTrue(bs in batch_exec, "unexpected batch-size {}".format(bs)) # Get count from one of the stats self.assertEqual( - bc, batch_exec[bs], - "expected model-execution-count {} for batch size {}, got {}". - format(batch_exec[bs], bs, bc)) + bc, + batch_exec[bs], + "expected model-execution-count {} for batch size {}, got {}".format( + batch_exec[bs], bs, bc + ), + ) actual_request_cnt = stats.model_stats[0].inference_stats.success.count self.assertEqual( - actual_request_cnt, request_cnt, + actual_request_cnt, + request_cnt, "expected model-request-count {}, got {}".format( - request_cnt, actual_request_cnt)) + request_cnt, actual_request_cnt + ), + ) actual_exec_cnt = stats.model_stats[0].execution_count self.assertEqual( - actual_request_cnt, request_cnt, - "expected model-exec-count {}, got {}".format( - request_cnt, actual_exec_cnt)) + actual_request_cnt, + request_cnt, + "expected model-exec-count {}, got {}".format(request_cnt, actual_exec_cnt), + ) actual_infer_cnt = stats.model_stats[0].inference_count self.assertEqual( - actual_infer_cnt, infer_cnt, + actual_infer_cnt, + infer_cnt, "expected model-inference-count {}, got {}".format( - infer_cnt, actual_infer_cnt)) + infer_cnt, actual_infer_cnt + ), + ) def test_all_inputs(self): # Provide all inputs, send requests that don't form preferred batch @@ -173,11 +210,11 @@ def test_all_inputs(self): try: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=((4000, None),))) + threading.Thread(target=self.check_response, args=((4000, None),)) + ) threads.append( - threading.Thread(target=self.check_response, - args=((4000, None),))) + threading.Thread(target=self.check_response, args=((4000, None),)) + ) threads[0].start() threads[1].start() for t in threads: @@ -194,13 +231,19 @@ def test_optional_same_input(self): try: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=((4000, None),), - kwargs={'provided_inputs': ("INPUT1",)})) + threading.Thread( + target=self.check_response, + args=((4000, None),), + kwargs={"provided_inputs": ("INPUT1",)}, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=((4000, None),), - kwargs={'provided_inputs': ("INPUT1",)})) + threading.Thread( + target=self.check_response, + args=((4000, None),), + kwargs={"provided_inputs": ("INPUT1",)}, + ) + ) threads[0].start() threads[1].start() for t in threads: @@ -218,22 +261,34 @@ def test_optional_mix_inputs(self): try: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=((0, 4000),), - kwargs={'provided_inputs': ("INPUT0",)})) + threading.Thread( + target=self.check_response, + args=((0, 4000),), + kwargs={"provided_inputs": ("INPUT0",)}, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=((0, 4000),), - kwargs={'provided_inputs': ("INPUT1",)})) + threading.Thread( + target=self.check_response, + args=((0, 4000),), + kwargs={"provided_inputs": ("INPUT1",)}, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=((0, 4000),), - kwargs={'provided_inputs': ("INPUT0",)})) + threading.Thread( + target=self.check_response, + args=((0, 4000),), + kwargs={"provided_inputs": ("INPUT0",)}, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=((4000, None),), - kwargs={'provided_inputs': ("INPUT1",)})) + threading.Thread( + target=self.check_response, + args=((4000, None),), + kwargs={"provided_inputs": ("INPUT1",)}, + ) + ) for t in threads: t.start() time.sleep(0.5) @@ -253,19 +308,26 @@ def test_optional_mix_inputs_2(self): try: threads = [] threads.append( - threading.Thread(target=self.check_response, - args=((0, 4000),), - kwargs={'provided_inputs': ("INPUT0",)})) + threading.Thread( + target=self.check_response, + args=((0, 4000),), + kwargs={"provided_inputs": ("INPUT0",)}, + ) + ) threads.append( - threading.Thread(target=self.check_response, args=((0, 4000),))) + threading.Thread(target=self.check_response, args=((0, 4000),)) + ) threads.append( - threading.Thread(target=self.check_response, - args=((0, 4000),), - kwargs={'provided_inputs': ("INPUT0",)})) + threading.Thread( + target=self.check_response, + args=((0, 4000),), + kwargs={"provided_inputs": ("INPUT0",)}, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=((4000, None),))) + threading.Thread(target=self.check_response, args=((4000, None),)) + ) for t in threads: t.start() time.sleep(0.5) @@ -279,28 +341,28 @@ def test_optional_mix_inputs_2(self): def test_ensemble_all_inputs(self): # The ensemble is only a wrapper over 'identity_2_float32' - self.model_name_ = 'ensemble_identity_2_float32' + self.model_name_ = "ensemble_identity_2_float32" self.test_all_inputs() # From the ensemble's perspective, the requests are processed as it is self.check_status(self.model_name_, {1: 2}, 2, 2) def test_ensemble_optional_same_input(self): # The ensemble is only a wrapper over 'identity_2_float32' - self.model_name_ = 'ensemble_identity_2_float32' + self.model_name_ = "ensemble_identity_2_float32" self.test_optional_same_input() # From the ensemble's perspective, the requests are processed as it is self.check_status(self.model_name_, {1: 2}, 2, 2) def test_ensemble_optional_mix_inputs(self): # The ensemble is only a wrapper over 'identity_2_float32' - self.model_name_ = 'ensemble_identity_2_float32' + self.model_name_ = "ensemble_identity_2_float32" self.test_optional_mix_inputs() # From the ensemble's perspective, the requests are processed as it is self.check_status(self.model_name_, {1: 4}, 4, 4) def test_ensemble_optional_mix_inputs_2(self): # The ensemble is only a wrapper over 'identity_2_float32' - self.model_name_ = 'ensemble_identity_2_float32' + self.model_name_ = "ensemble_identity_2_float32" self.test_optional_mix_inputs_2() # From the ensemble's perspective, the requests are processed as it is self.check_status(self.model_name_, {1: 4}, 4, 4) @@ -310,7 +372,7 @@ def test_ensemble_optional_pipeline(self): # inputs, where the ensemble step only connects a subset of inputs # for the second model (which is valid because the disconnected inputs # are marked optional). See 'config.pbtxt' for detail. - self.model_name_ = 'pipeline_identity_2_float32' + self.model_name_ = "pipeline_identity_2_float32" # Provide all inputs, send requests that don't form preferred batch # so all requests should be returned after the queue delay @@ -321,28 +383,63 @@ def test_ensemble_optional_pipeline(self): inputs.append(self.inputs_[provided_input]) triton_client = grpcclient.InferenceServerClient("localhost:8001") - results = triton_client.infer(model_name=self.model_name_, - inputs=inputs) + results = triton_client.infer(model_name=self.model_name_, inputs=inputs) # OUTPU0 is always zero, OUTPUT1 = INPUT0 output_data = results.as_numpy("OUTPUT0") expected = np.zeros(shape=(1, 1), dtype=np.float32) self.assertTrue( np.array_equal(output_data, expected), - "{}, {}, expected: {}, got {}".format(self.model_name_, - "OUTPUT0", expected, - output_data)) + "{}, {}, expected: {}, got {}".format( + self.model_name_, "OUTPUT0", expected, output_data + ), + ) expected = self.input_data_["INPUT0"] output_data = results.as_numpy("OUTPUT1") self.assertTrue( np.array_equal(output_data, expected), - "{}, {}, expected: {}, got {}".format(self.model_name_, - "OUTPUT1", expected, - output_data)) + "{}, {}, expected: {}, got {}".format( + self.model_name_, "OUTPUT1", expected, output_data + ), + ) + except Exception as ex: + self.assertTrue(False, "unexpected error {}".format(ex)) + + def test_ensemble_optional_connecting_tensor(self): + # The ensemble is a special case of pipelining models with optional + # inputs, where the request will only produce a subset of inputs + # for the second model while the ensemble graph connects all inputs of + # the second model (which is valid because the not-provided inputs + # are marked optional). See 'config.pbtxt' for detail. + self.model_name_ = "optional_connecting_tensor" + + # Provide all inputs, send requests that don't form preferred batch + # so all requests should be returned after the queue delay + try: + provided_inputs = ("INPUT0",) + inputs = [] + outputs = [] + for provided_input in provided_inputs: + inputs.append(self.inputs_[provided_input]) + outputs.append(self.outputs_[provided_input]) + + triton_client = grpcclient.InferenceServerClient("localhost:8001") + results = triton_client.infer( + model_name=self.model_name_, inputs=inputs, outputs=outputs + ) + + expected = self.input_data_["INPUT0"] + output_data = results.as_numpy("OUTPUT0") + self.assertTrue( + np.array_equal(output_data, expected), + "{}, {}, expected: {}, got {}".format( + self.model_name_, "OUTPUT0", expected, output_data + ), + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_optional_input/test.sh b/qa/L0_optional_input/test.sh index 351be38d4d..8bfd113d32 100755 --- a/qa/L0_optional_input/test.sh +++ b/qa/L0_optional_input/test.sh @@ -41,6 +41,7 @@ rm -fr *.log mkdir -p ./models/identity_2_float32/1 mkdir -p ./models/ensemble_identity_2_float32/1 mkdir -p ./models/pipeline_identity_2_float32/1 +mkdir -p ./models/optional_connecting_tensor/1 # Basic test cases TEST_CASES=${TEST_CASES:="test_all_inputs \ @@ -51,8 +52,9 @@ TEST_CASES=${TEST_CASES:="test_all_inputs \ test_ensemble_optional_same_input \ test_ensemble_optional_mix_inputs \ test_ensemble_optional_mix_inputs_2 \ - test_ensemble_optional_pipeline"} - + test_ensemble_optional_pipeline \ + test_ensemble_optional_connecting_tensor"} +RET=0 for i in $TEST_CASES ; do # Restart server for every test to clear model stats run_server @@ -62,8 +64,6 @@ for i in $TEST_CASES ; do exit 1 fi - RET=0 - echo "Test: $i" >>$TEST_LOG set +e diff --git a/qa/L0_output_name/output_name_test.py b/qa/L0_output_name/output_name_test.py old mode 100644 new mode 100755 index e5efdaddc6..905174640c --- a/qa/L0_output_name/output_name_test.py +++ b/qa/L0_output_name/output_name_test.py @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,26 +26,20 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -import argparse -import numpy as np -import os -from builtins import range -from functools import partial -from PIL import Image import unittest + import test_util as tu +from tritongrpcclient import grpc_service_pb2, grpc_service_pb2_grpc import grpc -from tritongrpcclient import grpc_service_pb2 -from tritongrpcclient import grpc_service_pb2_grpc _trials = ("graphdef", "libtorch", "onnx", "plan", "savedmodel") class OutputNameValidationTest(tu.TestResultCollector): - def requestGenerator(self, model_name, output_name): request = grpc_service_pb2.ModelInferRequest() request.model_name = model_name @@ -58,12 +52,11 @@ def requestGenerator(self, model_name, output_name): request.inputs.extend([input]) - output = grpc_service_pb2.ModelInferRequest( - ).InferRequestedOutputTensor() + output = grpc_service_pb2.ModelInferRequest().InferRequestedOutputTensor() output.name = output_name request.outputs.extend([output]) - request.raw_input_contents.extend([bytes(4 * 'a', 'utf-8')]) + request.raw_input_contents.extend([bytes(4 * "a", "utf-8")]) return request @@ -78,14 +71,14 @@ def test_grpc(self): try: response = grpc_stub.ModelInfer(request) self.assertTrue( - False, - "unexpected success for unknown output " + model_name) + False, "unexpected success for unknown output " + model_name + ) except grpc.RpcError as rpc_error: msg = rpc_error.details() self.assertTrue( - msg.startswith( - "unexpected inference output 'DUMMY' for model")) + msg.startswith("unexpected inference output 'DUMMY' for model") + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_output_name/test.sh b/qa/L0_output_name/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_output_validation/lt_op_val_client.py b/qa/L0_output_validation/lt_op_val_client.py old mode 100644 new mode 100755 index 7647497fff..77b5a16e3f --- a/qa/L0_output_validation/lt_op_val_client.py +++ b/qa/L0_output_validation/lt_op_val_client.py @@ -27,43 +27,47 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -import requests import unittest + +import requests import test_util as tu class OutputValidationTest(tu.TestResultCollector): # for datatype mismatch def test_datatype(self): - url = 'http://localhost:8000/v2/models/libtorch_datatype_1_float32/infer' + url = "http://localhost:8000/v2/models/libtorch_datatype_1_float32/infer" body = '{"inputs":[{"name":"INPUT__0","shape":[1,1],"datatype":"FP32","data":[1.0]}],"outputs":[{"name":"OUTPUT__0"}]}' response = requests.post(url, data=body) msg = response.json()["error"] self.assertTrue( msg.startswith( "configuration expects datatype TYPE_INT32 for output 'OUTPUT__0', model provides TYPE_FP32" - )) + ) + ) # for output mismatch def test_index(self): - url = 'http://localhost:8000/v2/models/libtorch_index_1_float32/infer' + url = "http://localhost:8000/v2/models/libtorch_index_1_float32/infer" body = '{"inputs":[{"name":"INPUT__0","shape":[1,1],"datatype":"FP32","data":[1.0]}],"outputs":[{"name":"OUTPUT__1"}]}' response = requests.post(url, data=body) msg = response.json()["error"] self.assertTrue( msg.startswith( "The output OUTPUT__1 in the model configuration refers to an output index which doesn't exist. This model has 1 outputs" - )) + ) + ) # successful run def test_success(self): - url = 'http://localhost:8000/v2/models/libtorch_zero_1_float32/infer' + url = "http://localhost:8000/v2/models/libtorch_zero_1_float32/infer" body = '{"inputs":[{"name":"INPUT__0","shape":[1,1],"datatype":"FP32","data":[1.0]}],"outputs":[{"name":"OUTPUT__0"}]}' response = requests.post(url, data=body) self.assertEqual(response.status_code, 200) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_output_validation/test.sh b/qa/L0_output_validation/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_parallel_copy/parallel_copy_test.py b/qa/L0_parallel_copy/parallel_copy_test.py old mode 100644 new mode 100755 index c9b958f5ed..6748fee006 --- a/qa/L0_parallel_copy/parallel_copy_test.py +++ b/qa/L0_parallel_copy/parallel_copy_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,35 +27,39 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range +import functools import time import unittest +from builtins import range + import numpy as np import test_util as tu -import functools import tritonclient.grpc as grpcclient from tritonclient.utils import InferenceServerException class ParallelCopyTest(tu.TestResultCollector): - def setUp(self): self.client_ = grpcclient.InferenceServerClient("localhost:8001") self.dtype_ = np.float32 - self.model_name_ = tu.get_zero_model_name('plan', 1, self.dtype_) + self.model_name_ = tu.get_zero_model_name("plan", 1, self.dtype_) def _batch_input_duration(self, batch_size): stats = self.client_.get_inference_statistics(self.model_name_, "1") self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") self.assertEqual( - stats.model_stats[0].name, self.model_name_, - "expect model stats for model {}".format(self.model_name_)) + stats.model_stats[0].name, + self.model_name_, + "expect model stats for model {}".format(self.model_name_), + ) self.assertEqual( - stats.model_stats[0].version, "1", - "expect model stats for model {} version 1".format( - self.model_name_)) + stats.model_stats[0].version, + "1", + "expect model stats for model {} version 1".format(self.model_name_), + ) batch_stats = stats.model_stats[0].batch_stats @@ -69,10 +75,11 @@ def _run(self, batch_sizes): np.random.random([bs, 16 * 1024 * 1024]).astype(self.dtype_) for bs in batch_sizes ] - inputs = [[ - grpcclient.InferInput('INPUT0', [bs, 16 * 1024 * 1024], "FP32") - ] for bs in batch_sizes] - output = [grpcclient.InferRequestedOutput('OUTPUT0')] + inputs = [ + [grpcclient.InferInput("INPUT0", [bs, 16 * 1024 * 1024], "FP32")] + for bs in batch_sizes + ] + output = [grpcclient.InferRequestedOutput("OUTPUT0")] for idx in range(len(inputs)): inputs[idx][0].set_data_from_numpy(input_data[idx]) @@ -88,11 +95,12 @@ def callback(user_data, idx, result, error): before_compute_input_duration = self._batch_input_duration(batch_size) for idx in range(len(batch_sizes)): - self.client_.async_infer(model_name=self.model_name_, - inputs=inputs[idx], - callback=functools.partial( - callback, user_data, idx), - outputs=output) + self.client_.async_infer( + model_name=self.model_name_, + inputs=inputs[idx], + callback=functools.partial(callback, user_data, idx), + outputs=output, + ) # Wait until the results are available in user_data time_out = 20 @@ -107,19 +115,24 @@ def callback(user_data, idx, result, error): time_out = time_out - 1 time.sleep(1) done_cnt = functools.reduce( - lambda dc, x: dc + 1 if x is not None else dc, user_data, 0) + lambda dc, x: dc + 1 if x is not None else dc, user_data, 0 + ) self.assertEqual( - done_cnt, len(batch_sizes), - "expected {} responses, got {}".format(len(batch_sizes), done_cnt)) + done_cnt, + len(batch_sizes), + "expected {} responses, got {}".format(len(batch_sizes), done_cnt), + ) for idx in range(len(batch_sizes)): res = user_data[idx] self.assertFalse( type(res) == InferenceServerException, - "expected response for request {}, got exception {}".format( - idx, res)) - output_data = res.as_numpy('OUTPUT0') - self.assertTrue(np.array_equal(output_data, input_data[idx]), - "Mismatched output data for request {}".format(idx)) + "expected response for request {}, got exception {}".format(idx, res), + ) + output_data = res.as_numpy("OUTPUT0") + self.assertTrue( + np.array_equal(output_data, input_data[idx]), + "Mismatched output data for request {}".format(idx), + ) after_compute_input_duration = self._batch_input_duration(batch_size) return after_compute_input_duration - before_compute_input_duration @@ -134,13 +147,17 @@ def test_performance(self): # The following check is loose, local runs show that the speedup is not # significant (~15%), may be due to the dispatch overhead - # which cancels part of the improvment + # which cancels part of the improvement self.assertTrue( serialized_time > parallelized_time, - "Expected parallelized copy is faster than serialized copy") - print("serialized v.s. parallelized : {} v.s. {}".format( - serialized_time, parallelized_time)) + "Expected parallelized copy is faster than serialized copy", + ) + print( + "serialized v.s. parallelized : {} v.s. {}".format( + serialized_time, parallelized_time + ) + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_parameters/model_repository/ensemble/config.pbtxt b/qa/L0_parameters/model_repository/ensemble/config.pbtxt new file mode 100644 index 0000000000..383d89c9f6 --- /dev/null +++ b/qa/L0_parameters/model_repository/ensemble/config.pbtxt @@ -0,0 +1,68 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +platform: "ensemble" +max_batch_size: 0 + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +output [ + { + name: "key" + data_type: TYPE_STRING + dims: [ -1 ] + }, + { + name: "value" + data_type: TYPE_STRING + dims: [ -1 ] + } +] + +ensemble_scheduling +{ + step [ + { + model_name: "identity" + model_version: -1 + input_map { key: "INPUT0", value: "INPUT0" } + output_map { key: "OUTPUT0", value: "OUTPUT0" } + }, + { + model_name: "parameter" + model_version: -1 + input_map { key: "INPUT0", value: "OUTPUT0" } + output_map { key: "key", value: "key" } + output_map { key: "value", value: "value" } + } + ] +} diff --git a/qa/L0_parameters/model_repository/identity/config.pbtxt b/qa/L0_parameters/model_repository/identity/config.pbtxt new file mode 100644 index 0000000000..8908845574 --- /dev/null +++ b/qa/L0_parameters/model_repository/identity/config.pbtxt @@ -0,0 +1,44 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "identity" +max_batch_size: 0 + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] diff --git a/qa/L0_parameters/model_repository/parameter/1/model.py b/qa/L0_parameters/model_repository/parameter/1/model.py new file mode 100644 index 0000000000..c175860962 --- /dev/null +++ b/qa/L0_parameters/model_repository/parameter/1/model.py @@ -0,0 +1,77 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + inputs = [{"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [1]}] + outputs = [ + {"name": "key", "data_type": "TYPE_STRING", "dims": [-1]}, + {"name": "value", "data_type": "TYPE_STRING", "dims": [-1]}, + ] + + config = auto_complete_model_config.as_dict() + input_names = [] + output_names = [] + for input in config["input"]: + input_names.append(input["name"]) + for output in config["output"]: + output_names.append(output["name"]) + + for input in inputs: + if input["name"] not in input_names: + auto_complete_model_config.add_input(input) + for output in outputs: + if output["name"] not in output_names: + auto_complete_model_config.add_output(output) + + auto_complete_model_config.set_max_batch_size(0) + return auto_complete_model_config + + def execute(self, requests): + # A simple model that puts the request parameters into the outputs. + responses = [] + for request in requests: + parameters = json.loads(request.parameters()) + keys = [] + values = [] + for key, value in parameters.items(): + keys.append(key) + values.append(value) + key_output = pb_utils.Tensor("key", np.asarray(keys, dtype=object)) + value_output = pb_utils.Tensor("value", np.asarray(values, dtype=object)) + inference_response = pb_utils.InferenceResponse( + output_tensors=[key_output, value_output] + ) + responses.append(inference_response) + + return responses diff --git a/qa/L0_parameters/parameters_test.py b/qa/L0_parameters/parameters_test.py new file mode 100755 index 0000000000..959f0fc5dc --- /dev/null +++ b/qa/L0_parameters/parameters_test.py @@ -0,0 +1,223 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import os +import queue +import unittest +from functools import partial +from unittest import IsolatedAsyncioTestCase + +import numpy as np +import tritonclient.grpc as grpcclient +import tritonclient.grpc.aio as asyncgrpcclient +import tritonclient.http as httpclient +import tritonclient.http.aio as asynchttpclient +from tritonclient.utils import InferenceServerException + +TEST_HEADER = os.environ.get("TEST_HEADER") + + +class InferenceParametersTest(IsolatedAsyncioTestCase): + async def asyncSetUp(self): + self.http = httpclient.InferenceServerClient(url="localhost:8000") + self.async_http = asynchttpclient.InferenceServerClient(url="localhost:8000") + self.grpc = grpcclient.InferenceServerClient(url="localhost:8001") + self.async_grpc = asyncgrpcclient.InferenceServerClient(url="localhost:8001") + + self.parameter_list = [] + self.parameter_list.append({"key1": "value1", "key2": "value2"}) + self.parameter_list.append({"key1": 1, "key2": 2}) + self.parameter_list.append({"key1": True, "key2": "value2"}) + self.parameter_list.append({"triton_": True, "key2": "value2"}) + + if TEST_HEADER == "1": + self.headers = { + "header_1": "value_1", + "header_2": "value_2", + "my_header_1": "my_value_1", + "my_header_2": "my_value_2", + "my_header_3": 'This is a "quoted" string with a backslash\ ', + } + + # only these headers should be forwarded to the model. + self.expected_headers = { + "my_header_1": "my_value_1", + "my_header_2": "my_value_2", + "my_header_3": 'This is a "quoted" string with a backslash\ ', + } + else: + self.headers = {} + self.expected_headers = {} + + def callback(user_data, result, error): + if error: + user_data.put(error) + else: + user_data.put(result) + + self.grpc_callback = callback + + def create_inputs(self, client_type): + inputs = [] + inputs.append(client_type.InferInput("INPUT0", [1], "FP32")) + + # Initialize the data + inputs[0].set_data_from_numpy(np.asarray([1], dtype=np.float32)) + return inputs + + async def send_request_and_verify( + self, client_type, client, is_async=False, model_name="parameter" + ): + inputs = self.create_inputs(client_type) + for parameters in self.parameter_list: + # Setup infer callable to re-use below for brevity + infer_callable = partial( + client.infer, + model_name=model_name, + inputs=inputs, + parameters=parameters, + headers=self.headers, + ) + + # The `triton_` prefix is reserved for Triton usage + should_error = False + if "triton_" in parameters.keys(): + should_error = True + + if is_async: + if should_error: + with self.assertRaises(InferenceServerException): + await infer_callable() + return + else: + result = await infer_callable() + else: + if should_error: + with self.assertRaises(InferenceServerException): + infer_callable() + return + else: + result = infer_callable() + + self.verify_outputs(result, parameters) + + def verify_outputs(self, result, parameters): + keys = result.as_numpy("key") + values = result.as_numpy("value") + keys = keys.astype(str).tolist() + expected_keys = list(parameters.keys()) + list(self.expected_headers.keys()) + self.assertEqual(set(keys), set(expected_keys)) + + # We have to convert the parameter values to string + expected_values = [] + for expected_value in list(parameters.values()): + expected_values.append(str(expected_value)) + for value in self.expected_headers.values(): + expected_values.append(value) + self.assertEqual(set(values.astype(str).tolist()), set(expected_values)) + + async def test_grpc_parameter(self): + await self.send_request_and_verify(grpcclient, self.grpc) + + async def test_http_parameter(self): + await self.send_request_and_verify(httpclient, self.http) + + async def test_async_http_parameter(self): + await self.send_request_and_verify( + asynchttpclient, self.async_http, is_async=True + ) + + async def test_async_grpc_parameter(self): + await self.send_request_and_verify( + asyncgrpcclient, self.async_grpc, is_async=True + ) + + def test_http_async_parameter(self): + inputs = self.create_inputs(httpclient) + # Skip the parameter that returns an error + parameter_list = self.parameter_list[:-1] + for parameters in parameter_list: + result = self.http.async_infer( + model_name="parameter", + inputs=inputs, + parameters=parameters, + headers=self.headers, + ).get_result() + self.verify_outputs(result, parameters) + + def test_grpc_async_parameter(self): + user_data = queue.Queue() + inputs = self.create_inputs(grpcclient) + # Skip the parameter that returns an error + parameter_list = self.parameter_list[:-1] + for parameters in parameter_list: + self.grpc.async_infer( + model_name="parameter", + inputs=inputs, + parameters=parameters, + headers=self.headers, + callback=partial(self.grpc_callback, user_data), + ) + result = user_data.get() + self.assertFalse(result is InferenceServerException) + self.verify_outputs(result, parameters) + + def test_grpc_stream_parameter(self): + user_data = queue.Queue() + self.grpc.start_stream( + callback=partial(self.grpc_callback, user_data), headers=self.headers + ) + inputs = self.create_inputs(grpcclient) + # Skip the parameter that returns an error + parameter_list = self.parameter_list[:-1] + for parameters in parameter_list: + # async stream infer + self.grpc.async_stream_infer( + model_name="parameter", inputs=inputs, parameters=parameters + ) + result = user_data.get() + self.assertFalse(result is InferenceServerException) + self.verify_outputs(result, parameters) + self.grpc.stop_stream() + + async def test_ensemble_parameter_forwarding(self): + await self.send_request_and_verify(httpclient, self.http, model_name="ensemble") + + async def asyncTearDown(self): + self.http.close() + self.grpc.close() + await self.async_grpc.close() + await self.async_http.close() + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_parameters/test.sh b/qa/L0_parameters/test.sh new file mode 100755 index 0000000000..967ead15c7 --- /dev/null +++ b/qa/L0_parameters/test.sh @@ -0,0 +1,95 @@ +#!/bin/bash +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +CLIENT_LOG="./client.log" +TEST_SCRIPT_PY="parameters_test.py" + +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_LOG="./inference_server.log" +source ../common/util.sh + +MODELDIR="model_repository" +# Use identity model as dummy step to ensure parameters pass through each step +mkdir -p "${MODELDIR}/identity/1" +mkdir -p "${MODELDIR}/ensemble/1" + +# TODO: Add support and testing for C++ client parameters: +# https://jirasw.nvidia.com/browse/DLIS-4673 + +RET=0 +for i in {0..1}; do + + # TEST_HEADER is a parameter used by `parameters_test.py` that controls + # whether the script will test for inclusion of headers in parameters or not. + if [ $i == 1 ]; then + SERVER_ARGS="--model-repository=${MODELDIR} --exit-timeout-secs=120 --grpc-header-forward-pattern my_header.* --http-header-forward-pattern my_header.*" + else + SERVER_ARGS="--model-repository=${MODELDIR} --exit-timeout-secs=120" + fi + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + set +e + TEST_HEADER=$i python3 $TEST_SCRIPT_PY >$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 + fi + + set -e + + kill $SERVER_PID + wait $SERVER_PID +done + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + cat $CLIENT_LOG + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET + diff --git a/qa/L0_passive_instance/models/distributed_int32_int32_int32/config.pbtxt b/qa/L0_passive_instance/models/distributed_int32_int32_int32/config.pbtxt old mode 100755 new mode 100644 diff --git a/qa/L0_passive_instance/passive_instance_test.py b/qa/L0_passive_instance/passive_instance_test.py old mode 100644 new mode 100755 index 38a2724f6e..d7cdfffa7b --- a/qa/L0_passive_instance/passive_instance_test.py +++ b/qa/L0_passive_instance/passive_instance_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,24 +27,25 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") import unittest -import numpy as np + import infer_util as iu +import numpy as np import test_util as tu -import tritonclient.http as httpclient class PassiveInstanceTest(tu.TestResultCollector): - def test_inference(self): try: - iu.infer_exact(self, "distributed", (1, 16), 1, np.int32, np.int32, - np.int32) + iu.infer_exact( + self, "distributed", (1, 16), 1, np.int32, np.int32, np.int32 + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_passive_instance/test.sh b/qa/L0_passive_instance/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json b/qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json new file mode 100644 index 0000000000..d0feacd9b4 --- /dev/null +++ b/qa/L0_perf_analyzer/perf_analyzer_profile_export_schema.json @@ -0,0 +1,95 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/examples/schema.json", + "title": "Perf Analyzer output data", + "description": "A json file describing the output from a Perf Analyzer run.", + "type": "object", + "required": [ + "experiments", + "version" + ], + "properties": { + "experiments": { + "description": "The array of all experiments run by Perf Analyzer.", + "type": "array", + "required": [ + "experiment", + "requests", + "window_boundaries" + ], + "minItems": 1, + "uniqueItems": true, + "items": { + "type": "object", + "properties": { + "experiment": { + "description": "A single experiment run by Perf Analyzer.", + "type": "object", + "required": [ + "mode", + "value" + ], + "minItems": 1, + "maxItems": 1, + "properties": { + "mode": { + "description": "Operating mode of Perf Analyzer: For example, 'concurrency' or 'request rate'.", + "type": "string" + }, + "value": { + "description": "Concurrency or request rate for the current experiment.", + "type": "integer" + } + } + }, + "requests": { + "description": "The array of requests sent by Perf Analyzer for this experiment.", + "type": "array", + "items": { + "$ref": "#/properties/experiments/items/properties/$defs/request" + } + }, + "$defs": { + "request": { + "description": "Info for a single request.", + "type": "object", + "required": [ + "timestamp", + "response_timestamps" + ], + "properties": { + "timestamp": { + "description": "Time stamp of the request.", + "type": "integer" + }, + "sequence_id": { + "description": "The sequence_id of the request.", + "type": "integer" + }, + "response_timestamps": { + "description": "All associated responses to this request.", + "type": "array", + "items": { + "type": "integer" + } + } + } + } + }, + "window_boundaries": { + "description": "An array of time stamps describing window boundaries.", + "type": "array", + "items": { + "type": "integer" + }, + "uniqueItems": true + } + } + } + }, + "version": { + "description": "The version of Perf Analyzer that generated the report.", + "type": "string" + } + } +} \ No newline at end of file diff --git a/qa/L0_perf_analyzer/test.sh b/qa/L0_perf_analyzer/test.sh index 4c2b7244a2..20a659da85 100755 --- a/qa/L0_perf_analyzer/test.sh +++ b/qa/L0_perf_analyzer/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -48,6 +48,7 @@ TESTDATADIR=`pwd`/test_data INT_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/int_data.json INT_DIFFSHAPE_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/int_data_diff_shape.json +INT_OPTIONAL_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/int_data_optional.json FLOAT_DIFFSHAPE_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/float_data_with_shape.json STRING_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/string_data.json STRING_WITHSHAPE_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/string_data_with_shape.json @@ -63,12 +64,16 @@ WRONG_OUTPUT_2_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/wrong_ SEQ_OUTPUT_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/seq_output.json SEQ_WRONG_OUTPUT_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/seq_wrong_output.json +REPEAT_INT32_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/repeat_int32_data.json + SERVER=/opt/tritonserver/bin/tritonserver SERVER_ARGS="--model-repository=${DATADIR}" SERVER_LOG="./inference_server.log" ERROR_STRING="error | Request count: 0 | : 0 infer/sec" +STABILITY_THRESHOLD="100" + source ../common/util.sh rm -f $SERVER_LOG $CLIENT_LOG @@ -112,6 +117,18 @@ cp -r ../custom_models/custom_zero_1_float32 $DATADIR && \ echo "{ key: \"execute_delay_ms\"; value: { string_value: \"100\" }}" >> config.pbtxt && \ echo "]" >> config.pbtxt) +# Copy and customize optional inputs model +cp -r ../python_models/optional $DATADIR && \ + mkdir $DATADIR/optional/1 && \ + mv $DATADIR/optional/model.py $DATADIR/optional/1 && \ + sed -i 's/max_batch_size: 0/max_batch_size: 2/g' $DATADIR/optional/config.pbtxt + +# Copy decoupled model +git clone --depth=1 https://github.com/triton-inference-server/python_backend +mkdir -p $DATADIR/repeat_int32/1 +cp python_backend/examples/decoupled/repeat_config.pbtxt $DATADIR/repeat_int32/config.pbtxt +cp python_backend/examples/decoupled/repeat_model.py $DATADIR/repeat_int32/1/model.py + # Generating test data mkdir -p $TESTDATADIR for INPUT in INPUT0 INPUT1; do @@ -136,7 +153,7 @@ fi SERVER_ERROR_STRING="The previous sequence did not end before this sequence start" set +e -$PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object -p2000 >$CLIENT_LOG 2>&1 +$PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object -p2000 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed: Expected an error when using dynamic shapes in string inputs\n***" @@ -148,21 +165,9 @@ if [ $(cat $CLIENT_LOG | grep "input INPUT0 contains dynamic shape, provide sha RET=1 fi -$PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object -p2000 --shape INPUT0 >$CLIENT_LOG 2>&1 -if [ $? -eq 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** Test Failed: Expected an error when using dynamic shapes with incorrect arguments\n***" - RET=1 -fi -if [ $(cat $CLIENT_LOG | grep "failed to parse input shape. There must be a colon after input name." | wc -l) -eq 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** Test Failed: \n***" - RET=1 -fi - # Testing with ensemble and sequential model variants $PERF_ANALYZER -v -i grpc -m simple_savedmodel_sequence_object -p 2000 -t5 --streaming \ ---input-data=$SEQ_JSONDATAFILE --input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1 +--input-data=$SEQ_JSONDATAFILE --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -180,7 +185,7 @@ if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then fi $PERF_ANALYZER -v -i grpc -m simple_savedmodel_sequence_object -p 1000 --request-rate-range 100:200:50 --streaming \ ---input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1 +--input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -205,7 +210,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 \ - --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -218,7 +223,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 -a \ - --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -238,7 +243,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 1 -p2000 -b 1 \ - --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -251,7 +256,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 1 -p2000 -b 1 -a \ - --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -269,7 +274,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 2 -p2000 -b 64 \ - --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -282,7 +287,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m inception_v1_graphdef -t 2 -p2000 -b 64 \ - --shared-memory=$SHARED_MEMORY_TYPE -a >$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -a -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -300,7 +305,7 @@ for PROTOCOL in grpc http; do for MODEL in graphdef_nobatch_int32_int32_int32 graphdef_int32_int32_int32; do # Valid batch size set +e - $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b 1 >$CLIENT_LOG 2>&1 + $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b 1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -311,7 +316,7 @@ for PROTOCOL in grpc http; do # Invalid batch sizes for STATIC_BATCH in 0 10; do set +e - $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b $STATIC_BATCH >$CLIENT_LOG 2>&1 + $PERF_ANALYZER -v -i $PROTOCOL -m $MODEL -t 1 -p2000 -b $STATIC_BATCH -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -323,7 +328,7 @@ for PROTOCOL in grpc http; do # Testing with the new arguments set +e - $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 >$CLIENT_LOG 2>&1 + $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -335,7 +340,7 @@ for PROTOCOL in grpc http; do RET=1 fi - $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --concurrency-range 1:5:2 >$CLIENT_LOG 2>&1 + $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --concurrency-range 1:5:2 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -348,7 +353,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --concurrency-range 1:5:2 \ - --input-data=${INT_JSONDATAFILE} >$CLIENT_LOG 2>&1 + --input-data=${INT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -361,7 +366,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:500 \ - -p1000 -b 1 -a>$CLIENT_LOG 2>&1 + -p1000 -b 1 -a -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -374,7 +379,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:500 \ - --input-data=${INT_JSONDATAFILE} -p1000 -b 1 -a>$CLIENT_LOG 2>&1 + --input-data=${INT_JSONDATAFILE} -p1000 -b 1 -a -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -388,7 +393,7 @@ for PROTOCOL in grpc http; do # Binary search for request rate mode $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:100 -p1000 -b 1 \ - -a --binary-search --request-distribution "poisson" -l 10 >$CLIENT_LOG 2>&1 + -a --binary-search --request-distribution "poisson" -l 10 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -400,11 +405,11 @@ for PROTOCOL in grpc http; do RET=1 fi set -e - + # Binary search for concurrency range mode and make sure it doesn't hang $PERF_ANALYZER -v -a --request-distribution "poisson" --shared-memory none \ --percentile 99 --binary-search --concurrency-range 1:8:2 -l 5 \ - -m graphdef_int32_int32_int32 -b 1 >$CLIENT_LOG 2>&1 & + -m graphdef_int32_int32_int32 -b 1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 & PA_PID=$! if [ "$PA_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $PERF_ANALYZER\n***" @@ -435,7 +440,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object --string-data=1 -p2000 \ - --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -453,7 +458,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object --input-data=$TESTDATADIR -p2000 \ - --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -470,7 +475,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_object_object --input-data=$STRING_JSONDATAFILE \ - --input-data=$STRING_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE>$CLIENT_LOG 2>&1 + --input-data=$STRING_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -488,7 +493,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_int32_int32 --input-data=$TESTDATADIR \ - --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE \ + --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} \ >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG @@ -506,7 +511,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_object_int32_int32 --input-data=$STRING_WITHSHAPE_JSONDATAFILE \ - --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE \ + --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} \ >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG @@ -523,7 +528,7 @@ for PROTOCOL in grpc http; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_float32 --shape INPUT0:2,8,2 \ - --shape INPUT1:2,8,2 -p2000 >$CLIENT_LOG 2>&1 + --shape INPUT1:2,8,2 -p2000 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -540,13 +545,13 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system cuda; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m graphdef_int32_int32_float32 --shape INPUT0:2,8,2 --shape INPUT1:2,8,2 -p2000 -b 4 \ - --shared-memory=$SHARED_MEMORY_TYPE --input-data=$INT_DIFFSHAPE_JSONDATAFILE >$CLIENT_LOG 2>&1 + --shared-memory=$SHARED_MEMORY_TYPE --input-data=$INT_DIFFSHAPE_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 fi - if [ $(cat $CLIENT_LOG | grep "can not batch tensors with different shapes together" | wc -l) -eq 0 ]; then + if [ $(cat $CLIENT_LOG | grep -P "The supplied shape .+ is incompatible with the model's input shape" | wc -l) -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 @@ -558,7 +563,7 @@ for PROTOCOL in grpc http; do for SHARED_MEMORY_TYPE in none system; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m plan_zero_1_float32 --input-data=$SHAPETENSORADTAFILE \ - --shape DUMMY_INPUT0:4,4 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -b 8 \ + --shape DUMMY_INPUT0:4,4 -p2000 --shared-memory=$SHARED_MEMORY_TYPE -b 8 -s ${STABILITY_THRESHOLD} \ >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG @@ -575,7 +580,7 @@ for PROTOCOL in grpc http; do set +e $PERF_ANALYZER -v -i $PROTOCOL -m simple_savedmodel_sequence_object -p 2000 -t5 --sync \ - --input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1 + --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -588,7 +593,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m simple_savedmodel_sequence_object -p 2000 -t5 --sync \ - --input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1 + --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -601,7 +606,7 @@ for PROTOCOL in grpc http; do fi $PERF_ANALYZER -v -i $PROTOCOL -m simple_savedmodel_sequence_object -p 1000 --request-rate-range 100:200:50 --sync \ - --input-data=$SEQ_JSONDATAFILE >$CLIENT_LOG 2>&1 + --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -621,13 +626,13 @@ for PROTOCOL in grpc http; do set +e # FIXME: Enable HTTP when the server is able to correctly return the complex error messages. $PERF_ANALYZER -v -i grpc -m graphdef_sequence_float32 --shape INPUT:2 --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE \ - --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE >$CLIENT_LOG 2>&1 + --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 --shared-memory=$SHARED_MEMORY_TYPE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 fi - if [ $(cat $CLIENT_LOG | grep "Inputs to operation Select of type Select must have the same size and shape." | wc -l) -eq 0 ]; then + if [ $(cat $CLIENT_LOG | grep -P "The supplied shape .+ is incompatible with the model's input shape" | wc -l) -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 @@ -635,11 +640,32 @@ for PROTOCOL in grpc http; do set -e done + # Testing that trace logging works + set +e + TRACE_FILE="trace.json" + rm ${TRACE_FILE}* + $PERF_ANALYZER -v -i $PROTOCOL -m simple_savedmodel_sequence_object -p 2000 -t5 --sync --trace-file $TRACE_FILE \ + --trace-level TIMESTAMPS --trace-rate 1000 --trace-count 100 --log-frequency 10 \ + --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 + fi + if ! compgen -G "$TRACE_FILE*" > /dev/null; then + echo -e "\n***\n*** Test Failed. $TRACE_FILE failed to generate.\n***" + RET=1 + elif [ $(cat ${TRACE_FILE}* | grep "REQUEST_START" | wc -l) -eq 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed. Did not find `REQUEST_START` in $TRACE_FILE \n***" + RET=1 + fi + set -e done # Test with output validation set +e -$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${NON_ALIGNED_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1 +$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${NON_ALIGNED_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -651,19 +677,19 @@ if [ $(cat $CLIENT_LOG | grep "The 'validation_data' field doesn't align with ' RET=1 fi -$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1 +$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 fi -if [ $(cat $CLIENT_LOG | grep "Output size doesn't match expected size" | wc -l) -eq 0 ]; then +if [ $(cat $CLIENT_LOG | grep "mismatch in the data provided" | wc -l) -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 fi -$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_2_JSONDATAFILE} >$CLIENT_LOG 2>&1 +$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${WRONG_OUTPUT_2_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -676,7 +702,7 @@ if [ $(cat $CLIENT_LOG | grep "Output doesn't match expected output" | wc -l) - fi -$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1 +$PERF_ANALYZER -v -m graphdef_int32_int32_int32 --input-data=${OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -689,7 +715,7 @@ if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then fi $PERF_ANALYZER -v -m simple_savedmodel_sequence_object -i grpc --streaming \ ---input-data=${SEQ_WRONG_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1 +--input-data=${SEQ_WRONG_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -702,7 +728,7 @@ if [ $(cat $CLIENT_LOG | grep "Output doesn't match expected output" | wc -l) - fi $PERF_ANALYZER -v -m simple_savedmodel_sequence_object -i grpc --streaming \ ---input-data=${SEQ_OUTPUT_JSONDATAFILE} >$CLIENT_LOG 2>&1 +--input-data=${SEQ_OUTPUT_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -722,7 +748,7 @@ for i in {1..9}; do done set +e $PERF_ANALYZER -v -m simple_savedmodel_sequence_object -p 10000 --concurrency-range 1500:2000:250 -i grpc --streaming \ -${INPUT_DATA_OPTION} >$CLIENT_LOG 2>&1 +${INPUT_DATA_OPTION} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -740,7 +766,7 @@ set +e # Send incorrect shape and make sure that perf_analyzer doesn't hang $PERF_ANALYZER -v -m graphdef_object_int32_int32 --measurement-mode "count_windows" \ - --shape INPUT0:1,8,100 --shape INPUT1:2,8 --string-data=1 >$CLIENT_LOG 2>&1 + --shape INPUT0:1,8,100 --shape INPUT1:2,8 --string-data=1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -eq 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -753,7 +779,7 @@ if [ $(cat $CLIENT_LOG | grep "unexpected shape for input 'INPUT0' for model" | fi $PERF_ANALYZER -v -m graphdef_object_int32_int32 --measurement-mode "count_windows" \ - --shape INPUT0:2,8 --shape INPUT1:2,8 --string-data=1 >$CLIENT_LOG 2>&1 + --shape INPUT0:2,8 --shape INPUT1:2,8 --string-data=1 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -766,6 +792,117 @@ if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then fi set -e +# Test with optional inputs missing but still valid +set +e +$PERF_ANALYZER -v -m optional --measurement-mode "count_windows" \ + --input-data=${INT_OPTIONAL_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +set -e + +# Test with optional inputs missing and invalid +set +e +OPTIONAL_INPUT_ERROR_STRING="For batch sizes larger than 1, the same set of +inputs must be specified for each batch. You cannot use different set of +optional inputs for each individual batch." +$PERF_ANALYZER -v -m optional -b 2 --measurement-mode "count_windows" \ + --input-data=${INT_OPTIONAL_JSONDATAFILE} -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 +if [ $? -eq 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep "${OPTIONAL_INPUT_ERROR_STRING}" | wc -l) -eq 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +set -e + + +# Test Custom request rate option +CUSTOM_SCHEDULE_FILE=$TESTDATADIR/custom.schedule +echo '30000' >> $CUSTOM_SCHEDULE_FILE +echo '10000' >> $CUSTOM_SCHEDULE_FILE +echo '40000' >> $CUSTOM_SCHEDULE_FILE +echo '20000' >> $CUSTOM_SCHEDULE_FILE +echo '25000' >> $CUSTOM_SCHEDULE_FILE + +set +e +$PERF_ANALYZER -v -i grpc -m graphdef_int32_int32_int32 --request-intervals $CUSTOM_SCHEDULE_FILE >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep "Request Rate: 40" | wc -l) -eq 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed: \n***" + RET=1 +fi +set -e + +# Test --serial-sequences mode +set +e +$PERF_ANALYZER -v -i $PROTOCOL -m simple_savedmodel_sequence_object -p 1000 --request-rate-range 100:200:50 --serial-sequences \ + --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi + +$PERF_ANALYZER -v -i $PROTOCOL -m simple_savedmodel_sequence_object -p 1000 --request-intervals $CUSTOM_SCHEDULE_FILE --serial-sequences \ + --input-data=$SEQ_JSONDATAFILE -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +set -e + +## Test decoupled model support +$PERF_ANALYZER -v -m repeat_int32 --input-data=$REPEAT_INT32_JSONDATAFILE \ + --profile-export-file profile_export.json -i grpc --async --streaming -s \ + ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +python3 -c "import json ; \ + requests = json.load(open('profile_export.json'))['experiments'][0]['requests'] ; \ + assert any(len(r['response_timestamps']) > 1 for r in requests)" +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +check-jsonschema --schemafile perf_analyzer_profile_export_schema.json profile_export.json +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi + ## Test perf_analyzer with MPI / multiple models is_synchronized() { @@ -851,10 +988,10 @@ set -e ## Test perf_analyzer without MPI library (`libmpi.so`) available -rm -rf /opt/hpcx +rm -rf /opt/hpcx/ompi/lib/libmpi* set +e -$PERF_ANALYZER -v -m graphdef_int32_int32_int32 +$PERF_ANALYZER -v -m graphdef_int32_int32_int32 -s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -907,6 +1044,7 @@ $PERF_ANALYZER -v -i grpc -m graphdef_int32_int32_int32 \ --ssl-grpc-root-certifications-file=ca.crt \ --ssl-grpc-private-key-file=client.key \ --ssl-grpc-certificate-chain-file=client.crt \ + -s ${STABILITY_THRESHOLD} \ > ${CLIENT_LOG}.grpc_success 2>&1 if [ $? -ne 0 ]; then cat ${CLIENT_LOG}.grpc_success @@ -919,6 +1057,7 @@ $PERF_ANALYZER -v -i grpc -m graphdef_int32_int32_int32 \ --ssl-grpc-root-certifications-file=ca.crt \ --ssl-grpc-private-key-file=client.key \ --ssl-grpc-certificate-chain-file=client2.crt \ + -s ${STABILITY_THRESHOLD} \ > ${CLIENT_LOG}.grpc_failure 2>&1 if [ $? -eq 0 ]; then cat ${CLIENT_LOG}.grpc_failure @@ -962,6 +1101,7 @@ $PERF_ANALYZER -v -u https://localhost:443 -i http -m graphdef_int32_int32_int32 --ssl-https-client-certificate-type PEM \ --ssl-https-private-key-file client.key \ --ssl-https-private-key-type PEM \ + -s ${STABILITY_THRESHOLD} \ > ${CLIENT_LOG}.https_success 2>&1 if [ $? -ne 0 ]; then cat ${CLIENT_LOG}.https_success @@ -971,7 +1111,8 @@ fi # Test that HTTP protocol with SSL works correctly without certificates $PERF_ANALYZER -v -u https://localhost:443 -i http -m graphdef_int32_int32_int32 \ --ssl-https-verify-peer 0 \ - --ssl-https-verify-host 0 + --ssl-https-verify-host 0 \ + -s ${STABILITY_THRESHOLD} \ > ${CLIENT_LOG}.https_success 2>&1 if [ $? -ne 0 ]; then cat ${CLIENT_LOG}.https_success @@ -987,6 +1128,7 @@ $PERF_ANALYZER -v -u https://localhost:443 -i http -m graphdef_int32_int32_int32 --ssl-https-client-certificate-type PEM \ --ssl-https-private-key-file client2.key \ --ssl-https-private-key-type PEM \ + -s ${STABILITY_THRESHOLD} \ > ${CLIENT_LOG}.https_failure 2>&1 if [ $? -eq 0 ]; then cat ${CLIENT_LOG}.https_failure diff --git a/qa/L0_perf_analyzer_capi/test.sh b/qa/L0_perf_analyzer_capi/test.sh index f447fe5d3d..f9fa3c078e 100755 --- a/qa/L0_perf_analyzer_capi/test.sh +++ b/qa/L0_perf_analyzer_capi/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -55,7 +55,8 @@ SEQ_JSONDATAFILE=`pwd`/../common/perf_analyzer_input_data_json/seq_data.json SHAPETENSORADTAFILE=`pwd`/../common/perf_analyzer_input_data_json/shape_tensor_data.json ERROR_STRING="error | Request count: 0 | : 0 infer/sec" -NON_SUPPORTED_ERROR_STRING="supported by C API" + +STABILITY_THRESHOLD="15" source ../common/util.sh @@ -81,6 +82,11 @@ cp -r /data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_sequ # Copying variable sequence model cp -r /data/inferenceserver/${REPO_VERSION}/qa_variable_sequence_model_repository/graphdef_sequence_float32 $DATADIR +# Copying bls model with undefined variable +mkdir -p $DATADIR/bls_undefined/1 && \ + cp ../python_models/bls_undefined/model.py $DATADIR/bls_undefined/1/. && \ + cp ../python_models/bls_undefined/config.pbtxt $DATADIR/bls_undefined/. + # Generating test data mkdir -p $TESTDATADIR for INPUT in INPUT0 INPUT1; do @@ -106,7 +112,7 @@ set -e $PERF_ANALYZER -v -m graphdef_int32_int32_int32 \ --service-kind=triton_c_api \ --model-repository=$DATADIR --triton-server-directory=$SERVER_LIBRARY_PATH \ ->$CLIENT_LOG 2>&1 +-s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -120,7 +126,8 @@ fi $PERF_ANALYZER -v -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ +>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -135,7 +142,8 @@ fi #Testing with string input $PERF_ANALYZER -v -m graphdef_object_object_object --string-data=1 -p2000 \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ +>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -151,7 +159,8 @@ fi $PERF_ANALYZER -v -m graphdef_object_int32_int32 --input-data=$TESTDATADIR \ --shape INPUT0:2,8 --shape INPUT1:2,8 \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ +>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -162,7 +171,7 @@ $PERF_ANALYZER -v -m graphdef_object_int32_int32 \ --input-data=$STRING_WITHSHAPE_JSONDATAFILE \ --shape INPUT0:2,8 --shape INPUT1:2,8 -p2000 \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ >$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG @@ -178,7 +187,8 @@ fi $PERF_ANALYZER -v -m graphdef_int32_int32_float32 --shape INPUT0:2,8,2 \ --shape INPUT1:2,8,2 -p2000 \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ +>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -194,7 +204,8 @@ fi $PERF_ANALYZER -v -m plan_zero_1_float32 --input-data=$SHAPETENSORADTAFILE \ --shape DUMMY_INPUT0:4,4 -p2000 -b 8 \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ +>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" @@ -206,73 +217,94 @@ if [ $(cat $CLIENT_LOG | grep ": 0 infer/sec\|: 0 usec" | wc -l) -ne 0 ]; then RET=1 fi -# TODO: Re-enable after sequence model support if fixed for CAPI -# $PERF_ANALYZER -v -m simple_savedmodel_sequence_object -p 2000 -t5 --sync \ -# --input-data=$SEQ_JSONDATAFILE \ -# --service-kind=triton_c_api --model-repository=$DATADIR \ -# --triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 -# if [ $? -ne 0 ]; then -# cat $CLIENT_LOG -# echo -e "\n***\n*** Test Failed\n***" -# RET=1 -# fi -# if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then -# cat $CLIENT_LOG -# echo -e "\n***\n*** Test Failed\n***" -# RET=1 -# fi -# -# TODO: Re-enable after variable model support if fixed for CAPI -# $PERF_ANALYZER -v -m graphdef_sequence_float32 --shape INPUT:2 \ -# --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE \ -# --input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 \ -# --service-kind=triton_c_api --model-repository=$DATADIR \ -# --triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 -# if [ $? -eq 0 ]; then -# cat $CLIENT_LOG -# echo -e "\n***\n*** Test Failed\n***" -# RET=1 -# fi -# if [ $(cat $CLIENT_LOG | grep "Inputs to operation Select of type Select must have the same size and shape." | wc -l) -eq 0 ]; then -# cat $CLIENT_LOG -# echo -e "\n***\n*** Test Failed\n***" -# RET=1 -# fi - -#Testing that async does NOT work +$PERF_ANALYZER -v -m simple_savedmodel_sequence_object -p 2000 -t5 --sync \ +--input-data=$SEQ_JSONDATAFILE \ +--service-kind=triton_c_api --model-repository=$DATADIR \ +--triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi + +set +e +$PERF_ANALYZER -v -m graphdef_sequence_float32 --shape INPUT:2 \ +--input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE \ +--input-data=$FLOAT_DIFFSHAPE_JSONDATAFILE -p2000 \ +--service-kind=triton_c_api --model-repository=$DATADIR \ +--triton-server-directory=$SERVER_LIBRARY_PATH --sync >$CLIENT_LOG 2>&1 +if [ $? -eq 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep -P "The supplied shape .+ is incompatible with the model's input shape" | wc -l) -eq 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +set -e + +# Negative test for the async mode. set +e $PERF_ANALYZER -v -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 -a \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 -if [ $(cat $CLIENT_LOG | grep "${NON_SUPPORTED_ERROR_STRING}" | wc -l) -ne 1 ]; then +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ +>$CLIENT_LOG 2>&1 +if [ $(cat $CLIENT_LOG | grep "not supported by triton_c_api service" | wc -l) -ne 1 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 fi set -e -#Testing that shared memory does NOT work for SHARED_MEMORY_TYPE in system cuda; do - set +e $PERF_ANALYZER -v -m graphdef_int32_int32_int32 -t 1 -p2000 -b 1 \ --shared-memory=$SHARED_MEMORY_TYPE \ --service-kind=triton_c_api --model-repository=$DATADIR \ --triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 - if [ $(cat $CLIENT_LOG | grep "${NON_SUPPORTED_ERROR_STRING}" | wc -l) -ne 1 ]; then + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 + fi + if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 fi - set -e done -# Testing --request-rate-range does NOT work -set +e $PERF_ANALYZER -v -m graphdef_int32_int32_int32 --request-rate-range 1000:2000:500 -p1000 -b 1 \ --service-kind=triton_c_api --model-repository=$DATADIR \ ---triton-server-directory=$SERVER_LIBRARY_PATH >$CLIENT_LOG 2>&1 -if [ $(cat $CLIENT_LOG | grep "${NON_SUPPORTED_ERROR_STRING}" | wc -l) -ne 1 ]; then +--triton-server-directory=$SERVER_LIBRARY_PATH -s ${STABILITY_THRESHOLD} \ +>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +if [ $(cat $CLIENT_LOG | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +set -e + +set +e +# Testing erroneous configuration +# This model is expected to fail +$PERF_ANALYZER -v -m bls_undefined --shape INPUT0:1048576 -t 64\ +--service-kind=triton_c_api \ +--model-repository=$DATADIR --triton-server-directory=$SERVER_LIBRARY_PATH \ +-s ${STABILITY_THRESHOLD} >$CLIENT_LOG 2>&1 +if [ $? -ne 99 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 diff --git a/qa/L0_perf_analyzer_doc_links/mkdocs.yml b/qa/L0_perf_analyzer_doc_links/mkdocs.yml new file mode 100644 index 0000000000..41a4bfe485 --- /dev/null +++ b/qa/L0_perf_analyzer_doc_links/mkdocs.yml @@ -0,0 +1,36 @@ +# Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +site_name: CI Test +use_directory_urls: False +docs_dir: "./docs" +plugins: + - htmlproofer + - search + +markdown_extensions: + - toc: + permalink: True diff --git a/qa/L0_perf_analyzer_doc_links/test.sh b/qa/L0_perf_analyzer_doc_links/test.sh new file mode 100755 index 0000000000..c0c195cd18 --- /dev/null +++ b/qa/L0_perf_analyzer_doc_links/test.sh @@ -0,0 +1,73 @@ +#!/bin/bash +# Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +LOG="`pwd`/doc_links.log" +CONFIG="`pwd`/mkdocs.yml" +RET=0 + +# Download necessary packages +python3 -m pip install mkdocs +python3 -m pip install mkdocs-htmlproofer-plugin==0.10.3 + +#Download perf_analyzer docs +TRITON_CLIENT_REPO_TAG="${TRITON_CLIENT_REPO_TAG:=main}" +git clone -b ${TRITON_CLIENT_REPO_TAG} https://github.com/triton-inference-server/client.git +cp `pwd`/client/src/c++/perf_analyzer/README.md . +cp -rf `pwd`/client/src/c++/perf_analyzer/docs . + +# Need to remove all links that start with -- or -. Mkdocs converts all -- to - for anchor links. +# This breaks all links to cli commands throughout the docs. This will iterate over all +# files in the docs directory and remove -- and - at the start of options, which allows the +# tool to check links for correctness. +for file in `pwd`/docs/*.md +do + echo $file + sed -i 's/`-*/`/g' $file + sed -i 's/#-*/#/g' $file +done + +exec mkdocs serve -f $CONFIG > $LOG & +PID=$! +sleep 20 + +until [[ (-z `pgrep mkdocs`) ]]; do + kill -2 $PID + sleep 2 +done + +if [[ ! -z `grep "invalid url" $LOG` ]]; then + cat $LOG + RET=1 +fi + + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test PASSED\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi +exit $RET diff --git a/qa/L0_perf_analyzer_ground_truth/test.sh b/qa/L0_perf_analyzer_ground_truth/test.sh new file mode 100755 index 0000000000..d5d78e63f4 --- /dev/null +++ b/qa/L0_perf_analyzer_ground_truth/test.sh @@ -0,0 +1,175 @@ +#!/bin/bash +# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "${REPO_VERSION}" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +source ../common/util.sh + +# Setup client/perf_analyzer +CLIENT_LOG="./perf_analyzer.log" +PERF_ANALYZER=../clients/perf_analyzer + +function check_perf_analyzer_error { + ERROR_STRING="error | Request count: 0 | : 0 infer/sec" + CLIENT_RET="$1" + if [ ${CLIENT_RET} -ne 0 ]; then + cat ${CLIENT_LOG} + echo -e "\n***\n*** Test Failed\n***" + RET=1 + fi + if [ $(cat ${CLIENT_LOG} | grep "${ERROR_STRING}" | wc -l) -ne 0 ]; then + cat ${CLIENT_LOG} + echo -e "\n***\n*** Test Failed\n***" + RET=1 + fi +} + +# Checks that the model infer/sec performance is equal to an expected value +# +/- some tolerance. +# $1: csv result file from PA run +# $2: expected infer/sec value +# $3: tolerance for expected value equality +function check_performance { + # get the boundary values based on the tolerance percentage + MIN=$(python3 -c "print(${2} * (1 - ${3}))") + MAX=$(python3 -c "print(${2} * (1 + ${3}))") + + # delete all but the 2nd line in the resulting file + # then get the 2nd column value which is the infer/sec measurement + report_val=$(sed '2!d' $1 | awk -F ',' {'print $2'}) + + # check if within tolerance + ret=$(python3 -c "print(${report_val} >= ${MIN} and ${report_val} <= ${MAX})") + if [ "$ret" = "False" ]; then + echo -e "\n***\n*** Test Failed\n***" + RET=1 + fi +} + +# Iterate over the grpc results to ensure gRPC times are greater than 0 +# $1: client log file +# example line: Avg gRPC time: 42648 usec (marshal 6 usec + response wait 42640 usec + unmarshal 2 usec) +function check_grpc_time { + grep "gRPC" $1 | awk '{print $4}' | while read -r line; do + if [ $line -eq 0 ]; then + RET=1 + fi + done +} + +# Create input_data.json to communicate the requested model delay +# $1: desired model delay +function create_input_data { + echo "{\"data\":[{\"INPUT0\" : [${1}]}]}" > input_data.json +} + +# Setup server +export CUDA_VISIBLE_DEVICES=0 +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_ARGS="--model-repository=`pwd`/models" +SERVER_LOG="./inference_server.log" + +rm -f $SERVER_LOG $CLIENT_LOG +MODEL_DIR="./models" +rm -fr ${MODEL_DIR} && mkdir ${MODEL_DIR} +MODELS="ground_truth" + +for model in ${MODELS}; do + # Add version directory to each model if non-existent + mkdir -p "${MODEL_DIR}/${model}/1" + cp ../python_models/${model}/model.py ./models/${model}/1/model.py + cp ../python_models/${model}/config.pbtxt ./models/${model}/config.pbtxt +done + +# Run server +run_server +if [ "${SERVER_PID}" == "0" ]; then + echo -e "\n***\n*** Failed to start ${SERVER}\n***" + cat ${SERVER_LOG} + exit 1 +fi + +# Run perf_analyzer +set +e +RET=0 +PROTOCOLS="http grpc" +OUTPUT_FILE="results" +MODEL_DELAYS=(0.05 0.5) +TOLERANCE="0.05" + +for model_delay in ${MODEL_DELAYS[@]}; do + create_input_data ${model_delay} + EXPECTED_RESULT=$(python3 -c "print(1 / ${model_delay})") + for protocol in ${PROTOCOLS}; do + for model in ${MODELS}; do + echo "================================================================" + echo "[PERMUTATION] Protocol=${protocol} Model=${model}" + echo "================================================================" + + ${PERF_ANALYZER} -v -i ${protocol} --concurrency-range 2 --input-data input_data.json -m ${model} -f ${OUTPUT_FILE} | tee ${CLIENT_LOG} 2>&1 + check_perf_analyzer_error $? + + check_performance ${OUTPUT_FILE} ${EXPECTED_RESULT} ${TOLERANCE} + + if [ "${protocol}" == "grpc" ]; then + check_grpc_time ${CLIENT_LOG} + fi + done; + done; +done; + + +set -e + +# Cleanup +kill $SERVER_PID +wait $SERVER_PID + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo "=== START SERVER LOG ===" + cat ${SERVER_LOG} + echo "=== END SERVER LOG ===" + echo "=== START CLIENT LOG ===" + cat ${CLIENT_LOG} + echo "=== END CLIENT LOG ===" + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit ${RET} diff --git a/qa/L0_perf_analyzer_report/test.sh b/qa/L0_perf_analyzer_report/test.sh index b820bd019e..7a04905842 100755 --- a/qa/L0_perf_analyzer_report/test.sh +++ b/qa/L0_perf_analyzer_report/test.sh @@ -125,14 +125,14 @@ done sed -i "s/${COMPOSING_MODEL}/${COMPOSING_MODEL_CACHE_ENABLED}/g" "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_ENABLED}/config.pbtxt" sed -i "s/${COMPOSING_MODEL}/${COMPOSING_MODEL_CACHE_DISABLED}/g" "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_DISABLED}/config.pbtxt" -## Append cache config to each model config -echo "response_cache { enable: True }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_ENABLED}/config.pbtxt" -echo "response_cache { enable: False }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_DISABLED}/config.pbtxt" -echo "response_cache { enable: True }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt" -echo "response_cache { enable: False }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt" +## Append cache config to each model config +echo -e "response_cache { enable: True }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_ENABLED}/config.pbtxt" +echo -e "response_cache { enable: False }" >> "${MODEL_DIR}/${ENSEMBLE_MODEL_CACHE_DISABLED}/config.pbtxt" +echo -e "response_cache { enable: True }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt" +echo -e "response_cache { enable: False }" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt" # Force CPU memory for composing models since cache doesn't currently support GPU memory -echo "instance_group [{ kind: KIND_CPU \n count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt" -echo "instance_group [{ kind: KIND_CPU \n count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt" +echo -e "instance_group [{ kind: KIND_CPU, count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_ENABLED}/config.pbtxt" +echo -e "instance_group [{ kind: KIND_CPU, count: 1 }]" >> "${MODEL_DIR}/${COMPOSING_MODEL_CACHE_DISABLED}/config.pbtxt" # Run server run_server @@ -146,13 +146,14 @@ fi set +e RET=0 PROTOCOLS="http grpc" +STABILITY_THRESHOLD="15" for protocol in ${PROTOCOLS}; do for model in ${MODELS}; do echo "================================================================" echo "[PERMUTATION] Protocol=${protocol} Model=${model}" echo "================================================================" - ${PERF_ANALYZER} -v -i ${protocol} -m ${model} | tee ${CLIENT_LOG} 2>&1 + ${PERF_ANALYZER} -v -i ${protocol} -m ${model} -s ${STABILITY_THRESHOLD} | tee ${CLIENT_LOG} 2>&1 check_perf_analyzer_error $? # Check response cache outputs diff --git a/qa/L0_perf_deeprecommender/run_test.sh b/qa/L0_perf_deeprecommender/run_test.sh index ca5fa8e27c..2fb74eadfc 100755 --- a/qa/L0_perf_deeprecommender/run_test.sh +++ b/qa/L0_perf_deeprecommender/run_test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,12 +28,13 @@ STATIC_BATCH_SIZES=${STATIC_BATCH_SIZES:=1} DYNAMIC_BATCH_SIZES=${DYNAMIC_BATCH_SIZES:=1} INSTANCE_COUNTS=${INSTANCE_COUNTS:=1} +TF_VERSION=${TF_VERSION:=2} PERF_CLIENT=../clients/perf_client REPORTER=../common/reporter.py SERVER=/opt/tritonserver/bin/tritonserver -SERVER_ARGS="--model-repository=`pwd`/models" +SERVER_ARGS="--model-repository=`pwd`/models --backend-config=tensorflow,version=${TF_VERSION}" source ../common/util.sh # Select the single GPU that will be available to the inference @@ -69,7 +70,8 @@ for STATIC_BATCH in $STATIC_BATCH_SIZES; do echo "dynamic_batching { preferred_batch_size: [ ${DYNAMIC_BATCH} ] }" >> config.pbtxt) fi - SERVER_LOG="${NAME}.serverlog" + echo "Time before starting server: $(date)" + SERVER_LOG="${NAME}.server.log" run_server if (( $SERVER_PID == 0 )); then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -78,6 +80,7 @@ for STATIC_BATCH in $STATIC_BATCH_SIZES; do fi set +e + echo "Time before perf analyzer trials: $(date)" # Run the model once to warm up. Some frameworks do # optimization on the first requests. Must warmup similar @@ -85,14 +88,22 @@ for STATIC_BATCH in $STATIC_BATCH_SIZES; do $PERF_CLIENT -v -i ${PERF_CLIENT_PROTOCOL} -m $MODEL_NAME -p5000 \ -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY} + set -o pipefail + PA_MAX_TRIALS=${PA_MAX_TRIALS:-"50"} $PERF_CLIENT -v -i ${PERF_CLIENT_PROTOCOL} -m $MODEL_NAME -p5000 \ -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY} \ + --max-trials "${PA_MAX_TRIALS}" \ -f ${NAME}.csv 2>&1 | tee ${NAME}.log if (( $? != 0 )); then + echo -e "\n***\n*** FAILED Perf Analyzer measurement\n***" RET=1 fi + echo "Time after perf analyzer trials: $(date)" + set +o pipefail + curl localhost:8002/metrics -o ${NAME}.metrics >> ${NAME}.log 2>&1 if (( $? != 0 )); then + echo -e "\n***\n*** FAILED to get metrics\n***" RET=1 fi diff --git a/qa/L0_perf_deeprecommender/test.sh b/qa/L0_perf_deeprecommender/test.sh index 2c528794af..3048e46cf5 100755 --- a/qa/L0_perf_deeprecommender/test.sh +++ b/qa/L0_perf_deeprecommender/test.sh @@ -43,7 +43,7 @@ TRTEXEC=/usr/src/tensorrt/bin/trtexec MODEL="deeprecommender" PROTOCOLS="grpc http" -rm -f *.log *.serverlog *.csv *.metrics *.tjson *.json +rm -f *.log *.csv *.metrics *.tjson *.json # # Test minimum latency @@ -58,6 +58,7 @@ rm -fr tensorrt_models && mkdir tensorrt_models (cd tensorrt_models/deeprecommender_plan && \ sed -i "s/^name:.*/name: \"deeprecommender_plan\"/" config.pbtxt && \ sed -i "s/tensorflow_graphdef/tensorrt_plan/" config.pbtxt && \ + sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt && \ sed -i "s/\[17736\]/\[17736,1,1\]/" config.pbtxt) $TRTEXEC --uff=$REPODIR/perf_model_store/deeprecommender_graphdef/deeprecommender_graphdef.uff \ @@ -117,6 +118,7 @@ rm -fr tensorrt_models && mkdir tensorrt_models (cd tensorrt_models/deeprecommender_plan && \ sed -i "s/^name:.*/name: \"deeprecommender_plan\"/" config.pbtxt && \ sed -i "s/tensorflow_graphdef/tensorrt_plan/" config.pbtxt && \ + sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt && \ sed -i "s/\[17736\]/\[17736,1,1\]/" config.pbtxt) $TRTEXEC --uff=$REPODIR/perf_model_store/deeprecommender_graphdef/deeprecommender_graphdef.uff \ diff --git a/qa/L0_perf_kaldi/create_data.sh b/qa/L0_perf_kaldi/create_data.sh old mode 100644 new mode 100755 index 68b32a4099..849b56d906 --- a/qa/L0_perf_kaldi/create_data.sh +++ b/qa/L0_perf_kaldi/create_data.sh @@ -25,7 +25,7 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# Needs to be run in asr_kaldi main directory and must be copied to +# Needs to be run in asr_kaldi main directory and must be copied to # draco for benchmark test TRITON_VERSION="20.05" diff --git a/qa/L0_perf_kaldi/test.sh b/qa/L0_perf_kaldi/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_perf_nomodel/run_test.sh b/qa/L0_perf_nomodel/run_test.sh index 8e79f82550..b1e2702ecb 100755 --- a/qa/L0_perf_nomodel/run_test.sh +++ b/qa/L0_perf_nomodel/run_test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -48,15 +48,14 @@ ARCH=${ARCH:="x86_64"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends MODEL_REPO="${PWD}/models" -SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR}" +PERF_CLIENT=../clients/perf_client +TF_VERSION=${TF_VERSION:=2} +SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}" source ../common/util.sh # DATADIR is already set in environment variable for aarch64 -if [ "$ARCH" == "aarch64" ]; then - PERF_CLIENT=${TRITON_DIR}/clients/bin/perf_client -else - PERF_CLIENT=../clients/perf_client - DATADIR=/data/inferenceserver/${REPO_VERSION} +if [ "$ARCH" != "aarch64" ]; then + DATADIR="/data/inferenceserver/${REPO_VERSION}" fi # Select the single GPU that will be available to the inference server @@ -75,12 +74,16 @@ if [[ $BACKENDS == *"python"* ]]; then sed -i "s/^name:.*/name: \"python_zero_1_float32\"/" config.pbtxt) fi +if [[ $BACKENDS == *"custom"* ]]; then + mkdir -p "custom_models/custom_zero_1_float32/1" +fi + PERF_CLIENT_PERCENTILE_ARGS="" && (( ${PERF_CLIENT_PERCENTILE} != 0 )) && PERF_CLIENT_PERCENTILE_ARGS="--percentile=${PERF_CLIENT_PERCENTILE}" -PERF_CLIENT_EXTRA_ARGS="$PERF_CLIENT_PERCENTILE_ARGS --shared-memory \"${SHARED_MEMORY}\"" +PERF_CLIENT_EXTRA_ARGS="$PERF_CLIENT_PERCENTILE_ARGS --shared-memory ${SHARED_MEMORY}" -# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and +# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and # reporting structure, though "triton_c_api" is not strictly a "protocol". if [[ "${PERF_CLIENT_PROTOCOL}" == "triton_c_api" ]]; then # Server will be run in-process with C API @@ -166,9 +169,10 @@ for BACKEND in $BACKENDS; do echo "dynamic_batching { preferred_batch_size: [ ${DYNAMIC_BATCH} ] }" >> config.pbtxt) fi + echo "Time before starting server: $(date)" # Only start separate server if not using C API, since C API runs server in-process if [[ "${PERF_CLIENT_PROTOCOL}" != "triton_c_api" ]]; then - SERVER_LOG="${RESULTDIR}/${NAME}.serverlog" + SERVER_LOG="${RESULTDIR}/${NAME}.server.log" run_server if [ $SERVER_PID == 0 ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -177,19 +181,26 @@ for BACKEND in $BACKENDS; do fi fi + echo "Time before perf analyzer trials: $(date)" set +e + set -o pipefail + PA_MAX_TRIALS=${PA_MAX_TRIALS:-"50"} $PERF_CLIENT -v \ -p${PERF_CLIENT_STABILIZE_WINDOW} \ -s${PERF_CLIENT_STABILIZE_THRESHOLD} \ ${PERF_CLIENT_EXTRA_ARGS} \ -m ${MODEL_NAME} \ -b${STATIC_BATCH} -t${CONCURRENCY} \ + --max-trials "${PA_MAX_TRIALS}" \ --shape ${INPUT_NAME}:${SHAPE} \ ${SERVICE_ARGS} \ -f ${RESULTDIR}/${NAME}.csv 2>&1 | tee ${RESULTDIR}/${NAME}.log if [ $? -ne 0 ]; then + echo -e "\n***\n*** FAILED Perf Analyzer measurement\n***" RET=1 fi + echo "Time after perf analyzer trials: $(date)" + set +o pipefail set -e echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${RESULTDIR}/${NAME}.tjson diff --git a/qa/L0_perf_nomodel/test.sh b/qa/L0_perf_nomodel/test.sh index 7f1051106a..6ff68303ed 100755 --- a/qa/L0_perf_nomodel/test.sh +++ b/qa/L0_perf_nomodel/test.sh @@ -38,7 +38,7 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} fi -rm -f *.log *.serverlog *.csv *.tjson *.json +rm -f *.log *.csv *.tjson *.json # Descriptive name for the current results UNDERTEST_NAME=${NVIDIA_TRITON_SERVER_VERSION} @@ -55,12 +55,12 @@ PERF_CLIENT_SLOWDOWN_THRESHOLD=5.0 # Length of window, in milliseconds, to use when stabilizing latency # and infer/sec results. -PERF_CLIENT_STABILIZE_WINDOW=5000 +PERF_CLIENT_STABILIZE_WINDOW=10000 # Threshold, as a percentage, to use when stabilizing latency and # infer/sec results. Values must vary by less than this percent over 3 # measurement windows to be considered value. -PERF_CLIENT_STABILIZE_THRESHOLD=5.0 +PERF_CLIENT_STABILIZE_THRESHOLD=15.0 RUNTEST=./run_test.sh diff --git a/qa/L0_perf_pyclients/simple_perf_client.py b/qa/L0_perf_pyclients/simple_perf_client.py old mode 100644 new mode 100755 index 00e1ea5427..fd02f94887 --- a/qa/L0_perf_pyclients/simple_perf_client.py +++ b/qa/L0_perf_pyclients/simple_perf_client.py @@ -26,13 +26,13 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse -import numpy as np +import sys import time +import numpy as np import tritonclient.grpc as grpcclient import tritonclient.http as httpclient -from tritonclient.utils import triton_to_np_dtype -from tritonclient.utils import InferenceServerException +from tritonclient.utils import InferenceServerException, triton_to_np_dtype FLAGS = None @@ -43,47 +43,59 @@ def parse_model_grpc(model_metadata, model_config): by this client. """ if len(model_metadata.inputs) != 1: - raise Exception("expecting 1 input, got {}".format( - len(model_metadata.inputs))) + raise Exception("expecting 1 input, got {}".format(len(model_metadata.inputs))) if len(model_metadata.outputs) != 1: - raise Exception("expecting 1 output, got {}".format( - len(model_metadata.outputs))) + raise Exception( + "expecting 1 output, got {}".format(len(model_metadata.outputs)) + ) if len(model_config.input) != 1: raise Exception( "expecting 1 input in model configuration, got {}".format( - len(model_config.input))) + len(model_config.input) + ) + ) input_metadata = model_metadata.inputs[0] output_metadata = model_metadata.outputs[0] - batch_dim = (model_config.max_batch_size > 0) + batch_dim = model_config.max_batch_size > 0 expected_dims = 1 + (1 if batch_dim else 0) if len(input_metadata.shape) != expected_dims: raise Exception( - "expecting input to have {} dimensions, model '{}' input has {}". - format(expected_dims, model_metadata.name, - len(input_metadata.shape))) + "expecting input to have {} dimensions, model '{}' input has {}".format( + expected_dims, model_metadata.name, len(input_metadata.shape) + ) + ) if len(output_metadata.shape) != expected_dims: raise Exception( - "expecting output to have {} dimensions, model '{}' output has {}". - format(expected_dims, model_metadata.name, - len(output_metadata.shape))) + "expecting output to have {} dimensions, model '{}' output has {}".format( + expected_dims, model_metadata.name, len(output_metadata.shape) + ) + ) if input_metadata.shape[-1] != -1: raise Exception( - "expecting input to have variable shape [-1], model '{}' input has {}" - .format(model_metadata.name, input_metadata.shape)) + "expecting input to have variable shape [-1], model '{}' input has {}".format( + model_metadata.name, input_metadata.shape + ) + ) if output_metadata.shape[-1] != -1: raise Exception( - "expecting output to have variable shape [-1], model '{}' output has {}" - .format(model_metadata.name, output_metadata.shape)) + "expecting output to have variable shape [-1], model '{}' output has {}".format( + model_metadata.name, output_metadata.shape + ) + ) - return (model_config.max_batch_size, input_metadata.name, - output_metadata.name, input_metadata.datatype) + return ( + model_config.max_batch_size, + input_metadata.name, + output_metadata.name, + input_metadata.datatype, + ) def parse_model_http(model_metadata, model_config): @@ -91,151 +103,176 @@ def parse_model_http(model_metadata, model_config): Check the configuration of a model to make sure it is supported by this client. """ - if len(model_metadata['inputs']) != 1: - raise Exception("expecting 1 input, got {}".format( - len(model_metadata['inputs']))) - if len(model_metadata['outputs']) != 1: - raise Exception("expecting 1 output, got {}".format( - len(model_metadata['outputs']))) - - if len(model_config['input']) != 1: + if len(model_metadata["inputs"]) != 1: + raise Exception( + "expecting 1 input, got {}".format(len(model_metadata["inputs"])) + ) + if len(model_metadata["outputs"]) != 1: + raise Exception( + "expecting 1 output, got {}".format(len(model_metadata["outputs"])) + ) + + if len(model_config["input"]) != 1: raise Exception( "expecting 1 input in model configuration, got {}".format( - len(model_config['input']))) + len(model_config["input"]) + ) + ) - input_metadata = model_metadata['inputs'][0] - output_metadata = model_metadata['outputs'][0] + input_metadata = model_metadata["inputs"][0] + output_metadata = model_metadata["outputs"][0] max_batch_size = 0 - if 'max_batch_size' in model_config: - max_batch_size = model_config['max_batch_size'] + if "max_batch_size" in model_config: + max_batch_size = model_config["max_batch_size"] - batch_dim = (max_batch_size > 0) + batch_dim = max_batch_size > 0 expected_dims = 1 + (1 if batch_dim else 0) - if len(input_metadata['shape']) != expected_dims: + if len(input_metadata["shape"]) != expected_dims: raise Exception( - "expecting input to have {} dimensions, model '{}' input has {}". - format(expected_dims, model_metadata.name, - len(input_metadata['shape']))) + "expecting input to have {} dimensions, model '{}' input has {}".format( + expected_dims, model_metadata.name, len(input_metadata["shape"]) + ) + ) - if len(output_metadata['shape']) != expected_dims: + if len(output_metadata["shape"]) != expected_dims: raise Exception( - "expecting output to have {} dimensions, model '{}' output has {}". - format(expected_dims, model_metadata.name, - len(output_metadata['shape']))) + "expecting output to have {} dimensions, model '{}' output has {}".format( + expected_dims, model_metadata.name, len(output_metadata["shape"]) + ) + ) - if input_metadata['shape'][-1] != -1: + if input_metadata["shape"][-1] != -1: raise Exception( - "expecting input to have variable shape [-1], model '{}' input has {}" - .format(model_metadata.name, input_metadata['shape'])) + "expecting input to have variable shape [-1], model '{}' input has {}".format( + model_metadata.name, input_metadata["shape"] + ) + ) - if output_metadata['shape'][-1] != -1: + if output_metadata["shape"][-1] != -1: raise Exception( - "expecting output to have variable shape [-1], model '{}' output has {}" - .format(model_metadata.name, output_metadata['shape'])) + "expecting output to have variable shape [-1], model '{}' output has {}".format( + model_metadata.name, output_metadata["shape"] + ) + ) - return (max_batch_size, input_metadata['name'], output_metadata['name'], - input_metadata['datatype']) + return ( + max_batch_size, + input_metadata["name"], + output_metadata["name"], + input_metadata["datatype"], + ) def requestGenerator(input_name, input_data, output_name, dtype, protocol): - # Set the input data inputs = [] if protocol.lower() == "grpc": - inputs.append(grpcclient.InferInput(input_name, input_data.shape, - dtype)) + inputs.append(grpcclient.InferInput(input_name, input_data.shape, dtype)) inputs[0].set_data_from_numpy(input_data) else: - inputs.append(httpclient.InferInput(input_name, input_data.shape, - dtype)) + inputs.append(httpclient.InferInput(input_name, input_data.shape, dtype)) inputs[0].set_data_from_numpy(input_data, binary_data=True) outputs = [] if protocol.lower() == "grpc": outputs.append(grpcclient.InferRequestedOutput(output_name)) else: - outputs.append( - httpclient.InferRequestedOutput(output_name, binary_data=True)) + outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True)) return inputs, outputs -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() - parser.add_argument('-v', - '--verbose', - action="store_true", - required=False, - default=False, - help='Enable verbose output') - parser.add_argument('-m', - '--model-name', - type=str, - required=True, - help='Name of model') parser.add_argument( - '-x', - '--model-version', + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-m", "--model-name", type=str, required=True, help="Name of model" + ) + parser.add_argument( + "-x", + "--model-version", type=str, required=False, default="", - help='Version of model. Default is to use latest version.') - parser.add_argument('-b', - '--batch-size', - type=int, - required=False, - default=1, - help='Batch size. Default is 1.') - parser.add_argument('-s', - '--shape', - type=int, - required=False, - default=1, - help='The shape of the tensor. Default is 1.') - parser.add_argument('-u', - '--url', - type=str, - required=False, - default='localhost:8000', - help='Inference server URL. Default is localhost:8000.') - parser.add_argument('-i', - '--protocol', - type=str, - required=False, - default='HTTP', - help='Protocol (HTTP/gRPC) used to communicate with ' + - 'the inference service. Default is HTTP.') - parser.add_argument('-c', - '--iteration_count', - type=int, - required=False, - default=1000, - help='The number of iterations. Default is 1000.') + help="Version of model. Default is to use latest version.", + ) parser.add_argument( - '-w', - '--warmup_count', + "-b", + "--batch-size", + type=int, + required=False, + default=1, + help="Batch size. Default is 1.", + ) + parser.add_argument( + "-s", + "--shape", + type=int, + required=False, + default=1, + help="The shape of the tensor. Default is 1.", + ) + parser.add_argument( + "-u", + "--url", + type=str, + required=False, + default="localhost:8000", + help="Inference server URL. Default is localhost:8000.", + ) + parser.add_argument( + "-i", + "--protocol", + type=str, + required=False, + default="HTTP", + help="Protocol (HTTP/gRPC) used to communicate with " + + "the inference service. Default is HTTP.", + ) + parser.add_argument( + "-c", + "--iteration_count", + type=int, + required=False, + default=1000, + help="The number of iterations. Default is 1000.", + ) + parser.add_argument( + "-w", + "--warmup_count", type=int, required=False, default=500, - help='The number of warm-up iterations. Default is 500.') + help="The number of warm-up iterations. Default is 500.", + ) parser.add_argument( - '--csv', + "--csv", type=str, required=False, default=None, - help='The name of the file to store the results in CSV format') + help="The name of the file to store the results in CSV format", + ) FLAGS = parser.parse_args() try: if FLAGS.protocol.lower() == "grpc": # Create gRPC client for communicating with the server triton_client = grpcclient.InferenceServerClient( - url=FLAGS.url, verbose=FLAGS.verbose) + url=FLAGS.url, verbose=FLAGS.verbose + ) else: triton_client = httpclient.InferenceServerClient( - url=FLAGS.url, verbose=FLAGS.verbose, concurrency=1) + url=FLAGS.url, verbose=FLAGS.verbose, concurrency=1 + ) except Exception as e: print("client creation failed: " + str(e)) sys.exit(1) @@ -244,7 +281,8 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol): # properties of the model that we need for preprocessing try: model_metadata = triton_client.get_model_metadata( - model_name=FLAGS.model_name, model_version=FLAGS.model_version) + model_name=FLAGS.model_name, model_version=FLAGS.model_version + ) except InferenceServerException as e: print("failed to retrieve the metadata: " + str(e)) sys.exit(1) @@ -253,36 +291,41 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol): # properties of the model that we need for preprocessing try: model_metadata = triton_client.get_model_metadata( - model_name=FLAGS.model_name, model_version=FLAGS.model_version) + model_name=FLAGS.model_name, model_version=FLAGS.model_version + ) except InferenceServerException as e: print("failed to retrieve the metadata: " + str(e)) sys.exit(1) try: model_config = triton_client.get_model_config( - model_name=FLAGS.model_name, model_version=FLAGS.model_version) + model_name=FLAGS.model_name, model_version=FLAGS.model_version + ) except InferenceServerException as e: print("failed to retrieve the config: " + str(e)) sys.exit(1) if FLAGS.protocol.lower() == "grpc": max_batch_size, input_name, output_name, dtype = parse_model_grpc( - model_metadata, model_config.config) + model_metadata, model_config.config + ) else: max_batch_size, input_name, output_name, dtype = parse_model_http( - model_metadata, model_config) + model_metadata, model_config + ) - input_data = np.zeros([FLAGS.batch_size, FLAGS.shape], - dtype=triton_to_np_dtype(dtype)) + input_data = np.zeros( + [FLAGS.batch_size, FLAGS.shape], dtype=triton_to_np_dtype(dtype) + ) # --------------------------- Warm-Up -------------------------------------------------------- for i in range(FLAGS.warmup_count): - inputs, outputs = requestGenerator(input_name, input_data, output_name, - dtype, FLAGS.protocol.lower()) - triton_client.infer(FLAGS.model_name, - inputs, - model_version=FLAGS.model_version, - outputs=outputs) + inputs, outputs = requestGenerator( + input_name, input_data, output_name, dtype, FLAGS.protocol.lower() + ) + triton_client.infer( + FLAGS.model_name, inputs, model_version=FLAGS.model_version, outputs=outputs + ) latencies = [] @@ -292,12 +335,12 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol): for i in range(FLAGS.iteration_count): t0 = time.time() - inputs, outputs = requestGenerator(input_name, input_data, output_name, - dtype, FLAGS.protocol.lower()) - triton_client.infer(FLAGS.model_name, - inputs, - model_version=FLAGS.model_version, - outputs=outputs) + inputs, outputs = requestGenerator( + input_name, input_data, output_name, dtype, FLAGS.protocol.lower() + ) + triton_client.infer( + FLAGS.model_name, inputs, model_version=FLAGS.model_version, outputs=outputs + ) latencies.append(time.time() - t0) end_time = time.time() @@ -320,12 +363,17 @@ def requestGenerator(input_name, input_data, output_name, dtype, protocol): # --------------------------- Write CSV -------------------------------------------------------- if FLAGS.csv != None: - file = open(FLAGS.csv, 'w') + file = open(FLAGS.csv, "w") file.write( "Concurrency,Inferences/Second,p50 latency,p90 latency,p95 latency,p99 latency\n" ) - file.write("1,{},{},{},{},{}".format(throughput, p50_latency * 1000, - p90_latency * 1000, - p95_latency * 1000, - p99_latency * 1000)) + file.write( + "1,{},{},{},{},{}".format( + throughput, + p50_latency * 1000, + p90_latency * 1000, + p95_latency * 1000, + p99_latency * 1000, + ) + ) file.close() diff --git a/qa/L0_perf_pyclients/test.sh b/qa/L0_perf_pyclients/test.sh index 57350a512c..9b7e405977 100755 --- a/qa/L0_perf_pyclients/test.sh +++ b/qa/L0_perf_pyclients/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -43,8 +43,10 @@ REPORTER=../common/reporter.py CLIENT_LOG="./simple_perf_client.log" SIMPLE_PERF_CLIENT=simple_perf_client.py +TF_VERSION=${TF_VERSION:=2} + SERVER=/opt/tritonserver/bin/tritonserver -SERVER_ARGS="--model-repository=`pwd`/custom_models" +SERVER_ARGS="--model-repository=`pwd`/custom_models --backend-config=tensorflow,version=${TF_VERSION}" source ../common/util.sh # Select the single GPU that will be available to the inference diff --git a/qa/L0_perf_resnet/run_test.sh b/qa/L0_perf_resnet/run_test.sh index 953aab71d3..579d00c0e5 100755 --- a/qa/L0_perf_resnet/run_test.sh +++ b/qa/L0_perf_resnet/run_test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,6 +28,7 @@ STATIC_BATCH=${STATIC_BATCH:=1} INSTANCE_CNT=${INSTANCE_CNT:=1} BACKEND_CONFIG=${BACKEND_CONFIG:=""} +TF_VERSION=${TF_VERSION:=2} REPORTER=../common/reporter.py @@ -35,7 +36,7 @@ TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends MODEL_REPO="${PWD}/models" -SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} ${BACKEND_CONFIG}" +SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} ${BACKEND_CONFIG} --backend-config=tensorflow,version=${TF_VERSION}" source ../common/util.sh # Select the single GPU that will be available to the inference @@ -53,20 +54,16 @@ rm -fr models && mkdir -p models && \ sed -i "s/^max_batch_size:.*/max_batch_size: ${MAX_BATCH}/" config.pbtxt && \ echo "instance_group [ { count: ${INSTANCE_CNT} }]") -# Onnx and onnx-trt models are very slow on Jetson. MEASUREMENT_WINDOW=5000 +PERF_CLIENT=../clients/perf_client +# Onnx and onnx-trt models are very slow on Jetson. if [ "$ARCH" == "aarch64" ]; then - PERF_CLIENT=${TRITON_DIR}/clients/bin/perf_client if [ "$MODEL_FRAMEWORK" == "onnx" ] || [ "$MODEL_FRAMEWORK" == "onnx_trt" ]; then MEASUREMENT_WINDOW=20000 fi -else - PERF_CLIENT=../clients/perf_client fi -set +e - -# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and +# Overload use of PERF_CLIENT_PROTOCOL for convenience with existing test and # reporting structure, though "triton_c_api" is not strictly a "protocol". if [[ "${PERF_CLIENT_PROTOCOL}" == "triton_c_api" ]]; then # Server will be run in-process with C API @@ -76,7 +73,7 @@ if [[ "${PERF_CLIENT_PROTOCOL}" == "triton_c_api" ]]; then else SERVICE_ARGS="-i ${PERF_CLIENT_PROTOCOL}" - SERVER_LOG="${NAME}.serverlog" + SERVER_LOG="${NAME}.server.log" run_server if (( $SERVER_PID == 0 )); then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -88,19 +85,27 @@ else # Must warmup similar to actual run so that all instances are ready # Note: Running extra PA for warmup doesn't make sense for C API since it # uses in-process tritonserver which will exit along with this PA process. + set +e $PERF_CLIENT -v -m $MODEL_NAME -p${MEASUREMENT_WINDOW} \ -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY} \ ${SERVICE_ARGS} + set -e fi +set +e +set -o pipefail +PA_MAX_TRIALS=${PA_MAX_TRIALS:-"50"} # Measure perf client results and write them to a file for reporting $PERF_CLIENT -v -m $MODEL_NAME -p${MEASUREMENT_WINDOW} \ -b${STATIC_BATCH} --concurrency-range ${CONCURRENCY} \ + --max-trials "${PA_MAX_TRIALS}" \ ${SERVICE_ARGS} \ -f ${NAME}.csv 2>&1 | tee ${NAME}.log if (( $? != 0 )); then + echo -e "\n***\n*** FAILED Perf Analyzer measurement\n***" RET=1 fi +set +o pipefail set -e echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson diff --git a/qa/L0_perf_resnet/test.sh b/qa/L0_perf_resnet/test.sh index afdc4911d2..93b946ec35 100755 --- a/qa/L0_perf_resnet/test.sh +++ b/qa/L0_perf_resnet/test.sh @@ -38,7 +38,7 @@ if [ ! -z "$TEST_REPO_ARCH" ]; then REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} fi -rm -f *.log *.serverlog *.csv *.tjson *.json +rm -f *.log *.csv *.tjson *.json PROTOCOLS="grpc http triton_c_api" @@ -110,7 +110,8 @@ done rm -fr tensorrt_models && mkdir tensorrt_models cp -r $REPODIR/caffe_models/trt_model_store/resnet50_plan tensorrt_models/${TRT_MODEL_NAME} && \ (cd tensorrt_models/${TRT_MODEL_NAME} && \ - sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt) && \ + sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt && \ + sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt) && \ mkdir -p tensorrt_models/${TRT_MODEL_NAME}/1 $CAFFE2PLAN -h -b ${STATIC_BATCH} \ -n prob -o tensorrt_models/${TRT_MODEL_NAME}/1/model.plan \ @@ -167,7 +168,8 @@ CONCURRENCY=4 rm -fr tensorrt_models && mkdir tensorrt_models cp -r $REPODIR/caffe_models/trt_model_store/resnet50_plan tensorrt_models/${TRT_MODEL_NAME} && \ (cd tensorrt_models/${TRT_MODEL_NAME} && \ - sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt) && \ + sed -i "s/^name:.*/name: \"${TRT_MODEL_NAME}\"/" config.pbtxt && \ + sed -i "s/max_batch_size:.*/max_batch_size: ${STATIC_BATCH}/" config.pbtxt) && \ mkdir -p tensorrt_models/${TRT_MODEL_NAME}/1 $CAFFE2PLAN -h -b ${STATIC_BATCH} \ -n prob -o tensorrt_models/${TRT_MODEL_NAME}/1/model.plan \ diff --git a/qa/L0_perf_tfs/test.sh b/qa/L0_perf_tfs/test.sh deleted file mode 100755 index 9d44d241c1..0000000000 --- a/qa/L0_perf_tfs/test.sh +++ /dev/null @@ -1,153 +0,0 @@ -#!/bin/bash -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions -# are met: -# * Redistributions of source code must retain the above copyright -# notice, this list of conditions and the following disclaimer. -# * Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and/or other materials provided with the distribution. -# * Neither the name of NVIDIA CORPORATION nor the names of its -# contributors may be used to endorse or promote products derived -# from this software without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY -# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR -# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, -# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR -# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY -# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE -# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -if [ "$#" -ge 1 ]; then - REPO_VERSION=$1 -fi -if [ -z "$REPO_VERSION" ]; then - echo -e "Repository version must be specified" - echo -e "\n***\n*** Test Failed\n***" - exit 1 -fi -if [ ! -z "$TEST_REPO_ARCH" ]; then - REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} -fi - -apt update - # needed by perf_analyzer -apt install -y libb64-dev -# needed by reporter -apt install -y python3 python3-pip python3-dev -rm -f /usr/bin/python && ln -s /usr/bin/python3 /usr/bin/python -pip3 install --upgrade requests - -REPODIR=/data/inferenceserver/${REPO_VERSION} -rm -f *.log *.csv *.tjson *.json -rm -rf model_store - -RET=0 - -# Create model_store -MODEL_NAME="resnet50v1.5_fp16_savedmodel" -mkdir model_store -mkdir -p model_store/${MODEL_NAME} -cp -r ${REPODIR}/perf_model_store/${MODEL_NAME}/1/model.savedmodel model_store/${MODEL_NAME}/1 - -# Run server -tensorflow_model_server --port=8500 --model_name=${MODEL_NAME} --model_base_path=$PWD/model_store/${MODEL_NAME} > server.log 2>&1 & -SERVER_PID=$! -# Wait for the server to start -sleep 10 -if [ "$SERVER_PID" == "0" ]; then - echo -e "\n***\n*** Failed to start server\n***" - cat server.log - exit 1 -fi - -PERF_ANALYZER=/perf_bin/perf_analyzer -REPORTER=../common/reporter.py - -# To get the minimum latency -STATIC_BATCH=1 -NAME=${MODEL_NAME}_sbatch${STATIC_BATCH} - -# Run client -# To warmup the model -$PERF_ANALYZER -m ${MODEL_NAME} --service-kind tfserving -i grpc -b 1 -p 5000 -# Collect data -$PERF_ANALYZER -m ${MODEL_NAME} --service-kind tfserving -i grpc -b ${STATIC_BATCH} -p 5000 -f ${NAME}.csv >> ${NAME}.log 2>&1 -if (( $? != 0 )); then - RET=1 -fi - -echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson -echo -e "\"s_benchmark_name\":\"resnet50\"," >> ${NAME}.tjson -echo -e "\"s_server\":\"tfserving\"," >> ${NAME}.tjson -echo -e "\"s_protocol\":\"grpc\"," >> ${NAME}.tjson -echo -e "\"s_framework\":\"savedmodel\"," >> ${NAME}.tjson -echo -e "\"s_model\":\"${MODEL_NAME}\"," >> ${NAME}.tjson -echo -e "\"l_concurrency\":1," >> ${NAME}.tjson -echo -e "\"l_batch_size\":1," >> ${NAME}.tjson -echo -e "\"l_instance_count\":1}]" >> ${NAME}.tjson - -if [ -f $REPORTER ]; then - set +e - - URL_FLAG= - if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then - URL_FLAG="-u ${BENCHMARK_REPORTER_URL}" - fi - - $REPORTER -v -o ${NAME}.json --csv ${NAME}.csv ${URL_FLAG} ${NAME}.tjson - if (( $? != 0 )); then - RET=1 - fi - - set -e -fi - -# Large static batch size case. -STATIC_BATCH=128 -NAME=${MODEL_NAME}_sbatch${STATIC_BATCH} -$PERF_ANALYZER -m ${MODEL_NAME} --service-kind tfserving -i grpc -b ${STATIC_BATCH} -p 5000 -f ${NAME}.csv >> ${NAME}.log 2>&1 -if (( $? != 0 )); then - RET=1 -fi - -echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson -echo -e "\"s_benchmark_name\":\"resnet50\"," >> ${NAME}.tjson -echo -e "\"s_server\":\"tfserving\"," >> ${NAME}.tjson -echo -e "\"s_protocol\":\"grpc\"," >> ${NAME}.tjson -echo -e "\"s_framework\":\"savedmodel\"," >> ${NAME}.tjson -echo -e "\"s_model\":\"${MODEL_NAME}\"," >> ${NAME}.tjson -echo -e "\"l_concurrency\":1," >> ${NAME}.tjson -echo -e "\"l_batch_size\":128," >> ${NAME}.tjson -echo -e "\"l_instance_count\":1}]" >> ${NAME}.tjson - -if [ -f $REPORTER ]; then - set +e - - URL_FLAG= - if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then - URL_FLAG="-u ${BENCHMARK_REPORTER_URL}" - fi - - $REPORTER -v -o ${NAME}.json --csv ${NAME}.csv ${URL_FLAG} ${NAME}.tjson - if (( $? != 0 )); then - RET=1 - fi - - set -e -fi - -if (( $RET == 0 )); then - echo -e "\n***\n*** Test Passed\n***" -else - echo -e "\n***\n*** Test FAILED\n***" -fi - -exit $RET diff --git a/qa/L0_perf_ts/test.sh b/qa/L0_perf_ts/test.sh deleted file mode 100755 index f308a43c1e..0000000000 --- a/qa/L0_perf_ts/test.sh +++ /dev/null @@ -1,124 +0,0 @@ -#!/bin/bash -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions -# are met: -# * Redistributions of source code must retain the above copyright -# notice, this list of conditions and the following disclaimer. -# * Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and/or other materials provided with the distribution. -# * Neither the name of NVIDIA CORPORATION nor the names of its -# contributors may be used to endorse or promote products derived -# from this software without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY -# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR -# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, -# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR -# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY -# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE -# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - -if [ "$#" -ge 1 ]; then - REPO_VERSION=$1 -fi -if [ -z "$REPO_VERSION" ]; then - echo -e "Repository version must be specified" - echo -e "\n***\n*** Test Failed\n***" - exit 1 -fi -if [ ! -z "$TEST_REPO_ARCH" ]; then - REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} -fi - -# TODO: DLIS-3777 following key update is required only while base image -# is not updated accordingly -apt-key del 7fa2af80 -apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/$(uname -m)/3bf863cc.pub - -apt update -apt install -y libb64-dev curl -apt install -y python3 python3-pip python3-dev -pip3 install --upgrade requests - -REPODIR=/data/inferenceserver/${REPO_VERSION} -PERF_ANALYZER=/perf_bin/perf_analyzer -REPORTER=../common/reporter.py - -rm -f *.log *.csv *.tjson *.json log4j.properties -rm -rf model_store -rm -rf serve - -RET=0 - -# Create model archive. Using default handler for image classification -MODEL_NAME="resnet50_fp32_libtorch" -mkdir model_store -torch-model-archiver --model-name resnet50 --version 1.0 --serialized-file ${REPODIR}/perf_model_store/${MODEL_NAME}/1/model.pt \ ---export-path model_store --handler image_classifier -f -# Suppressing the logging for better performance -echo "log4j.rootLogger = OFF" >> log4j.properties -# Run server -torchserve --start --ncs --model-store=model_store --models model_store/resnet50.mar --log-config log4j.properties - -sleep 5 - -# Get the input image to be used for generating requests -STATIC_BATCH=1 -curl -O https://raw.githubusercontent.com/pytorch/serve/master/docs/images/kitten_small.jpg -echo "{\"data\":[{\"TORCHSERVE_INPUT\" : [\"kitten_small.jpg\"]}]}" >> data.json -NAME=${MODEL_NAME}_sbatch${STATIC_BATCH} -PERF_ANALYZER_ARGS="-m resnet50 --service-kind torchserve -i http -u localhost:8080 -b ${STATIC_BATCH} -p 5000 --input-data data.json" - -# Run client -# To warmup the model -$PERF_ANALYZER ${PERF_ANALYZER_ARGS} -# Collect data -$PERF_ANALYZER ${PERF_ANALYZER_ARGS} -f ${NAME}.csv >> ${NAME}.log 2>&1 -if (( $? != 0 )); then - RET=1 -fi - -torchserve --stop - -echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >> ${NAME}.tjson -echo -e "\"s_benchmark_name\":\"preprocess+resnet50\"," >> ${NAME}.tjson -echo -e "\"s_server\":\"torchserve\"," >> ${NAME}.tjson -echo -e "\"s_protocol\":\"http\"," >> ${NAME}.tjson -echo -e "\"s_framework\":\"libtorch\"," >> ${NAME}.tjson -echo -e "\"s_model\":\"${MODEL_NAME}\"," >> ${NAME}.tjson -echo -e "\"l_concurrency\":1," >> ${NAME}.tjson -echo -e "\"l_batch_size\":1," >> ${NAME}.tjson -echo -e "\"l_instance_count\":1}]" >> ${NAME}.tjson - - -if [ -f $REPORTER ]; then - set +e - - URL_FLAG= - if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then - URL_FLAG="-u ${BENCHMARK_REPORTER_URL}" - fi - - python $REPORTER -v -o ${NAME}.json --csv ${NAME}.csv ${URL_FLAG} ${NAME}.tjson - if (( $? != 0 )); then - RET=1 - fi - - set -e -fi - -if (( $RET == 0 )); then - echo -e "\n***\n*** Test Passed\n***" -else - echo -e "\n***\n*** Test FAILED\n***" -fi - -exit $RET diff --git a/qa/L0_perf_vllm/test.sh b/qa/L0_perf_vllm/test.sh new file mode 100755 index 0000000000..498f6f8e14 --- /dev/null +++ b/qa/L0_perf_vllm/test.sh @@ -0,0 +1,146 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +source ../common/util.sh + +REPORTER=../common/reporter.py +TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} +SERVER=${TRITON_DIR}/bin/tritonserver +BACKEND_DIR=${TRITON_DIR}/backends +MODEL_REPO="${PWD}/models" +NAME="vllm_benchmarking_test" +MODEL_NAME="gpt2_vllm" +INPUT_DATA="./input_data.json" +SERVER_LOG="${NAME}_server.log" +SERVER_ARGS="--model-repository=${MODEL_REPO} --backend-directory=${BACKEND_DIR} --log-verbose=1" + +export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:=0} +EXPORT_FILE=profile-export-vllm-model.json + +pip3 install tritonclient nvidia-ml-py3 +rm -rf $MODEL_REPO $EXPORT_FILE *.tjson *.json *.csv + +mkdir -p $MODEL_REPO/$MODEL_NAME/1 +echo '{ + "model":"gpt2", + "disable_log_requests": "true", + "gpu_memory_utilization": 0.5 +}' >$MODEL_REPO/$MODEL_NAME/1/model.json + +echo 'backend: "vllm" +instance_group [ + { + count: 1 + kind: KIND_MODEL + } +]' >$MODEL_REPO/$MODEL_NAME/config.pbtxt + +echo '{ + "data": [ + { + "text_input": [ + "hi hi hi hi hi hi hi hi hi hi" + ], + "stream": [ + true + ], + "sampling_parameters": [ + "{\"max_tokens\": 1024, \"ignore_eos\": true}" + ] + } + ] +}' >$INPUT_DATA + +RET=0 +ARCH="amd64" +STATIC_BATCH=1 +INSTANCE_CNT=1 +CONCURRENCY=100 +MODEL_FRAMEWORK="vllm" +PERF_CLIENT_PROTOCOL="grpc" +PERF_CLIENT=perf_analyzer + +# Set stability-percentage 999 to bypass the stability check in PA. +# LLM generates a sequence of tokens that is unlikely to be within a reasonable bound to determine valid measurement in terms of latency. +# Using "count_windows" measurement mode, which automatically extends the window for collecting responses. +PERF_CLIENT_ARGS="-v -m $MODEL_NAME --concurrency-range=${CONCURRENCY} --measurement-mode=count_windows --measurement-request-count=10 \ + --input-data=$INPUT_DATA --profile-export-file=$EXPORT_FILE -i $PERF_CLIENT_PROTOCOL --async --streaming --stability-percentage=999" + +run_server +if (($SERVER_PID == 0)); then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +$PERF_CLIENT $PERF_CLIENT_ARGS -f ${NAME}.csv 2>&1 | tee ${NAME}_perf_analyzer.log +set +o pipefail +set -e + +if [[ -n "${SERVER_PID}" ]]; then + kill $SERVER_PID + wait $SERVER_PID +fi + +echo -e "[{\"s_benchmark_kind\":\"benchmark_perf\"," >>${NAME}.tjson +echo -e "\"s_benchmark_repo_branch\":\"${BENCHMARK_REPO_BRANCH}\"," >>${NAME}.tjson +echo -e "\"s_benchmark_name\":\"${NAME}\"," >>${NAME}.tjson +echo -e "\"s_server\":\"triton\"," >>${NAME}.tjson +echo -e "\"s_protocol\":\"${PERF_CLIENT_PROTOCOL}\"," >>${NAME}.tjson +echo -e "\"s_framework\":\"${MODEL_FRAMEWORK}\"," >>${NAME}.tjson +echo -e "\"s_model\":\"${MODEL_NAME}\"," >>${NAME}.tjson +echo -e "\"l_concurrency\":\"${CONCURRENCY}\"," >>${NAME}.tjson +echo -e "\"l_batch_size\":${STATIC_BATCH}," >>${NAME}.tjson +echo -e "\"l_instance_count\":${INSTANCE_CNT}," >>${NAME}.tjson +echo -e "\"s_architecture\":\"${ARCH}\"}]" >>${NAME}.tjson + +if [ -f $REPORTER ]; then + set +e + + URL_FLAG= + if [ ! -z ${BENCHMARK_REPORTER_URL} ]; then + URL_FLAG="-u ${BENCHMARK_REPORTER_URL}" + fi + + python3 $REPORTER -v -e ${EXPORT_FILE} -o ${NAME}.json --csv ${NAME}.csv --gpu-metrics --token-latency ${URL_FLAG} ${NAME}.tjson + if (($? != 0)); then + RET=1 + fi + + set -e +fi + +rm -rf $MODEL_REPO $INPUT_DATA + +if (($RET == 0)); then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_pinned_memory/test.sh b/qa/L0_pinned_memory/test.sh index 799c908b76..89b59d7c18 100755 --- a/qa/L0_pinned_memory/test.sh +++ b/qa/L0_pinned_memory/test.sh @@ -50,7 +50,7 @@ source ../common/util.sh # Select the single GPU that will be available to the inference server export CUDA_VISIBLE_DEVICES=0 -rm -f *.log *.serverlog *.csv *.metrics +rm -f *.log *.csv *.metrics RET=0 rm -fr ./custom_models && mkdir ./custom_models && \ @@ -81,7 +81,7 @@ for BACKEND in $BACKENDS; do # With pinned memory SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1" - SERVER_LOG="${ENSEMBLE_NAME}.pinned.serverlog" + SERVER_LOG="${ENSEMBLE_NAME}.pinned.server.log" run_server if (( $SERVER_PID == 0 )); then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -96,7 +96,7 @@ for BACKEND in $BACKENDS; do RET=1 fi - grep "] non-pinned" ${ENSEMBLE_NAME}.pinned.serverlog + grep "] non-pinned" ${ENSEMBLE_NAME}.pinned.server.log if [ $? -eq 0 ]; then echo -e "\n***\n*** Failed. Expected only pinned memory is allocated\n***" RET=1 @@ -108,7 +108,7 @@ for BACKEND in $BACKENDS; do # Restart the server without verbose logging SERVER_ARGS="--model-repository=`pwd`/models" - SERVER_LOG="${ENSEMBLE_NAME}.pinned.serverlog" + SERVER_LOG="${ENSEMBLE_NAME}.pinned.server.log" run_server if (( $SERVER_PID == 0 )); then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -133,7 +133,7 @@ for BACKEND in $BACKENDS; do # Without pinned memory SERVER_ARGS="--model-repository=`pwd`/models --pinned-memory-pool-byte-size=0 --log-verbose=1" - SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.serverlog" + SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.server.log" run_server if (( $SERVER_PID == 0 )); then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -148,7 +148,7 @@ for BACKEND in $BACKENDS; do RET=1 fi - grep "] pinned" ${ENSEMBLE_NAME}.nonpinned.serverlog + grep "] pinned" ${ENSEMBLE_NAME}.nonpinned.server.log if [ $? -eq 0 ]; then echo -e "\n***\n*** Failed. Expected only non-pinned memory is allocated\n***" RET=1 @@ -160,7 +160,7 @@ for BACKEND in $BACKENDS; do # Restart the server without verbose logging SERVER_ARGS="--model-repository=`pwd`/models --pinned-memory-pool-byte-size=0" - SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.serverlog" + SERVER_LOG="${ENSEMBLE_NAME}.nonpinned.server.log" run_server if (( $SERVER_PID == 0 )); then echo -e "\n***\n*** Failed to start $SERVER\n***" diff --git a/qa/L0_python_api/test.sh b/qa/L0_python_api/test.sh new file mode 100755 index 0000000000..c5021acae0 --- /dev/null +++ b/qa/L0_python_api/test.sh @@ -0,0 +1,50 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +TEST_LOG="./python_binding.log" + +RET=0 + +rm -f $TEST_LOG + +set +e + +python test_binding.py > $TEST_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Test Failed\n***" + RET=1 +fi +set -e + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + cat $TEST_LOG + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_jetson_example/test.sh b/qa/L0_python_client_unit_tests/test.sh old mode 100644 new mode 100755 similarity index 57% rename from qa/L0_jetson_example/test.sh rename to qa/L0_python_client_unit_tests/test.sh index 4d692a8b0a..5a46ecccc5 --- a/qa/L0_jetson_example/test.sh +++ b/qa/L0_python_client_unit_tests/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,39 +25,30 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tao/peoplenet/versions/pruned_v2.1/zip -O pruned_v2.1.zip -unzip pruned_v2.1.zip -d concurrency_and_dynamic_batching/tao/models/peoplenet && rm pruned_v2.1.zip +TEST_LOG="./python_client_unit_tests.log" +PYTHON_CLIENT_UNIT_TESTS_DIR=/opt/tritonserver/qa/python_client_unit_tests/ +PYTHON_CLIENT_UNIT_TESTS_CMD="python3 -m unittest discover -v -s $PYTHON_CLIENT_UNIT_TESTS_DIR -t $PYTHON_CLIENT_UNIT_TESTS_DIR" -# Use TAO convertor for JP4.6 -wget --content-disposition https://developer.nvidia.com/jp46-20210820t231431z-001zip -O jp4.6-20210820T231431Z-001.zip -unzip jp4.6-20210820T231431Z-001.zip && rm jp4.6-20210820T231431Z-001.zip +# DLPack test requires Torch to validate GPU tensor +pip3 install torch -cp tao-converter-jp46-trt8.0.1.6/tao-converter concurrency_and_dynamic_batching/tao/tao-converter && rm -rf jp4.6 -chmod 777 concurrency_and_dynamic_batching/tao/tao-converter +RET=0 -(cd concurrency_and_dynamic_batching/tao && bash convert_peoplenet.sh) +rm -f $TEST_LOG -# Build the example and make sure permissions -cd concurrency_and_dynamic_batching && make +set +e -CLIENT_LOG="./client.log" - -# Running the example/s -./people_detection -m gpu -v -r trtis_model_repo_sample_1 -t 6 -s false -p ${HOME}/tritonserver >> ${CLIENT_LOG}.1 2>&1 -if [ $? -ne 0 ]; then - cat $CLIENT_LOG.1 - RET=1 -fi - -./people_detection -m gpu -v -r trtis_model_repo_sample_2 -t 6 -s false -p ${HOME}/tritonserver >> ${CLIENT_LOG}.2 2>&1 +$PYTHON_CLIENT_UNIT_TESTS_CMD > $TEST_LOG 2>&1 if [ $? -ne 0 ]; then - cat $CLIENT_LOG.2 + echo -e "\n***\n*** Test Failed\n***" RET=1 fi +set -e if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else + cat $TEST_LOG echo -e "\n***\n*** Test FAILED\n***" fi diff --git a/qa/L0_query/query_e2e.py b/qa/L0_query/query_e2e.py old mode 100644 new mode 100755 index 69849749d4..048a4a8d41 --- a/qa/L0_query/query_e2e.py +++ b/qa/L0_query/query_e2e.py @@ -1,5 +1,5 @@ #!/usr/bin/env python -# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -27,26 +27,23 @@ import sys -sys.path.append('../common') +sys.path.append("../common") + +import unittest -import argparse import numpy as np -import os -from builtins import range -import tritonclient.http as tritonhttpclient +import test_util as tu import tritonclient.grpc as tritongrpcclient +import tritonclient.http as tritonhttpclient from tritonclient.utils import InferenceServerException from tritonclient.utils import cuda_shared_memory as cudashm -import unittest -import test_util as tu class QueryTest(tu.TestResultCollector): - def test_http(self): triton_client = tritonhttpclient.InferenceServerClient("localhost:8000") inputs = [] - inputs.append(tritonhttpclient.InferInput('INPUT', [1], "UINT8")) + inputs.append(tritonhttpclient.InferInput("INPUT", [1], "UINT8")) inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8)) try: @@ -59,33 +56,33 @@ def test_http(self): def test_http_shared_memory(self): triton_client = tritonhttpclient.InferenceServerClient("localhost:8000") inputs = [] - inputs.append(tritonhttpclient.InferInput('INPUT', [1], "UINT8")) + inputs.append(tritonhttpclient.InferInput("INPUT", [1], "UINT8")) inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8)) # Set up CUDA shared memory for outputs triton_client.unregister_system_shared_memory() triton_client.unregister_cuda_shared_memory() - shm_op0_handle = cudashm.create_shared_memory_region( - "output0_data", 4, 0) - shm_op1_handle = cudashm.create_shared_memory_region( - "output1_data", 4, 0) + shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 4, 0) + shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 4, 0) triton_client.register_cuda_shared_memory( - "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4) + "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4 + ) triton_client.register_cuda_shared_memory( - "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4) + "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4 + ) outputs = [] outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True) + ) outputs[-1].set_shared_memory("output0_data", 4) outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True) + ) outputs[-1].set_shared_memory("output1_data", 4) try: - triton_client.infer(model_name="query", - inputs=inputs, - outputs=outputs) + triton_client.infer(model_name="query", inputs=inputs, outputs=outputs) self.assertTrue(False, "expect error with query information") except InferenceServerException as ex: self.assertTrue("OUTPUT0 GPU 0" in ex.message()) @@ -99,34 +96,34 @@ def test_http_shared_memory(self): def test_http_out_of_shared_memory(self): triton_client = tritonhttpclient.InferenceServerClient("localhost:8000") inputs = [] - inputs.append(tritonhttpclient.InferInput('INPUT', [1], "UINT8")) + inputs.append(tritonhttpclient.InferInput("INPUT", [1], "UINT8")) inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8)) # Set up too small CUDA shared memory for outputs, expect query # returns default value triton_client.unregister_system_shared_memory() triton_client.unregister_cuda_shared_memory() - shm_op0_handle = cudashm.create_shared_memory_region( - "output0_data", 1, 0) - shm_op1_handle = cudashm.create_shared_memory_region( - "output1_data", 1, 0) + shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 1, 0) + shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 1, 0) triton_client.register_cuda_shared_memory( - "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1) + "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1 + ) triton_client.register_cuda_shared_memory( - "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1) + "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1 + ) outputs = [] outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True) + ) outputs[-1].set_shared_memory("output0_data", 1) outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True) + ) outputs[-1].set_shared_memory("output1_data", 1) try: - triton_client.infer(model_name="query", - inputs=inputs, - outputs=outputs) + triton_client.infer(model_name="query", inputs=inputs, outputs=outputs) self.assertTrue(False, "expect error with query information") except InferenceServerException as ex: self.assertTrue("OUTPUT0 CPU 0" in ex.message()) @@ -140,7 +137,7 @@ def test_http_out_of_shared_memory(self): def test_grpc(self): triton_client = tritongrpcclient.InferenceServerClient("localhost:8001") inputs = [] - inputs.append(tritongrpcclient.InferInput('INPUT', [1], "UINT8")) + inputs.append(tritongrpcclient.InferInput("INPUT", [1], "UINT8")) inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8)) try: @@ -153,31 +150,29 @@ def test_grpc(self): def test_grpc_shared_memory(self): triton_client = tritongrpcclient.InferenceServerClient("localhost:8001") inputs = [] - inputs.append(tritongrpcclient.InferInput('INPUT', [1], "UINT8")) + inputs.append(tritongrpcclient.InferInput("INPUT", [1], "UINT8")) inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8)) # Set up CUDA shared memory for outputs triton_client.unregister_system_shared_memory() triton_client.unregister_cuda_shared_memory() - shm_op0_handle = cudashm.create_shared_memory_region( - "output0_data", 4, 0) - shm_op1_handle = cudashm.create_shared_memory_region( - "output1_data", 4, 0) + shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 4, 0) + shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 4, 0) triton_client.register_cuda_shared_memory( - "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4) + "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 4 + ) triton_client.register_cuda_shared_memory( - "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4) + "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 4 + ) outputs = [] - outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0')) + outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0")) outputs[-1].set_shared_memory("output0_data", 4) - outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1')) + outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1")) outputs[-1].set_shared_memory("output1_data", 4) try: - triton_client.infer(model_name="query", - inputs=inputs, - outputs=outputs) + triton_client.infer(model_name="query", inputs=inputs, outputs=outputs) self.assertTrue(False, "expect error with query information") except InferenceServerException as ex: self.assertTrue("OUTPUT0 GPU 0" in ex.message()) @@ -191,32 +186,30 @@ def test_grpc_shared_memory(self): def test_grpc_out_of_shared_memory(self): triton_client = tritongrpcclient.InferenceServerClient("localhost:8001") inputs = [] - inputs.append(tritongrpcclient.InferInput('INPUT', [1], "UINT8")) + inputs.append(tritongrpcclient.InferInput("INPUT", [1], "UINT8")) inputs[0].set_data_from_numpy(np.arange(1, dtype=np.uint8)) # Set up too small CUDA shared memory for outputs, expect query # returns default value triton_client.unregister_system_shared_memory() triton_client.unregister_cuda_shared_memory() - shm_op0_handle = cudashm.create_shared_memory_region( - "output0_data", 1, 0) - shm_op1_handle = cudashm.create_shared_memory_region( - "output1_data", 1, 0) + shm_op0_handle = cudashm.create_shared_memory_region("output0_data", 1, 0) + shm_op1_handle = cudashm.create_shared_memory_region("output1_data", 1, 0) triton_client.register_cuda_shared_memory( - "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1) + "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0, 1 + ) triton_client.register_cuda_shared_memory( - "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1) + "output1_data", cudashm.get_raw_handle(shm_op1_handle), 0, 1 + ) outputs = [] - outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT0')) + outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT0")) outputs[-1].set_shared_memory("output0_data", 1) - outputs.append(tritongrpcclient.InferRequestedOutput('OUTPUT1')) + outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT1")) outputs[-1].set_shared_memory("output1_data", 1) try: - triton_client.infer(model_name="query", - inputs=inputs, - outputs=outputs) + triton_client.infer(model_name="query", inputs=inputs, outputs=outputs) self.assertTrue(False, "expect error with query information") except InferenceServerException as ex: self.assertTrue("OUTPUT0 CPU 0" in ex.message()) @@ -228,5 +221,5 @@ def test_grpc_out_of_shared_memory(self): triton_client.unregister_cuda_shared_memory() -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_query/test.sh b/qa/L0_query/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_rate_limiter/rate_limiter_test.py b/qa/L0_rate_limiter/rate_limiter_test.py old mode 100644 new mode 100755 index c02c50b61e..4bc7b82e70 --- a/qa/L0_rate_limiter/rate_limiter_test.py +++ b/qa/L0_rate_limiter/rate_limiter_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -29,11 +31,12 @@ sys.path.append("../common") import functools -import numpy as np import os -import unittest import threading import time +import unittest + +import numpy as np import sequence_util as su import tritongrpcclient as grpcclient from tritonclientutils import * @@ -46,7 +49,6 @@ class AsyncGrpcRunner: - def __init__(self, tester, server_url, model_name, delay_ms): self._tester = tester self._server_url = server_url @@ -79,18 +81,17 @@ def req_loop(self): client = grpcclient.InferenceServerClient(self._server_url) inputs = [ - grpcclient.InferInput("INPUT0", self._shape, - np_to_triton_dtype(self._dtype)) + grpcclient.InferInput( + "INPUT0", self._shape, np_to_triton_dtype(self._dtype) + ) ] self._inflight_requests = 0 - start_stat = client.get_inference_statistics( - model_name=self._model_name) + start_stat = client.get_inference_statistics(model_name=self._model_name) global _exit_signal while not _exit_signal: - input_numpy = np.random.random_sample(self._shape).astype( - self._dtype) + input_numpy = np.random.random_sample(self._shape).astype(self._dtype) inputs[0].set_data_from_numpy(input_numpy) self._input_data.append(input_numpy) @@ -99,12 +100,15 @@ def req_loop(self): def _check_can_send(): return self._inflight_requests < _inference_concurrency - can_send = self._sync.wait_for(_check_can_send, - timeout=_response_wait_time_s) + can_send = self._sync.wait_for( + _check_can_send, timeout=_response_wait_time_s + ) self._tester.assertTrue( can_send, "client didn't receive a response within {}s".format( - _response_wait_time_s)) + _response_wait_time_s + ), + ) callback = functools.partial(AsyncGrpcRunner._on_result, self) client.async_infer( @@ -115,7 +119,7 @@ def _check_can_send(): ) self._inflight_requests += 1 self._num_sent_request += 1 - if (self._num_sent_request == _inference_count): + if self._num_sent_request == _inference_count: _exit_signal = True time.sleep(self._delay_ms / 1000.0) @@ -125,17 +129,21 @@ def _check_can_send(): def _all_processed(): return self._inflight_requests == 0 - self._processed_all = self._sync.wait_for(_all_processed, - _finish_wait_time_s) + self._processed_all = self._sync.wait_for( + _all_processed, _finish_wait_time_s + ) self._tester.assertTrue( self._processed_all, - "the processing didn't complete even after waiting for {}s". - format(_finish_wait_time_s)) + "the processing didn't complete even after waiting for {}s".format( + _finish_wait_time_s + ), + ) end_stat = client.get_inference_statistics(model_name=self._model_name) - self._processed_request_count = end_stat.model_stats[ - 0].inference_stats.success.count - start_stat.model_stats[ - 0].inference_stats.success.count + self._processed_request_count = ( + end_stat.model_stats[0].inference_stats.success.count + - start_stat.model_stats[0].inference_stats.success.count + ) def start(self): self._req_thread.start() @@ -144,13 +152,15 @@ def _validate_run(self): if len(self._errors) != 0: raise self._errors[0] self._tester.assertEqual( - len(self._input_data), len(self._results.keys()), - "the number of inputs and output should match") + len(self._input_data), + len(self._results.keys()), + "the number of inputs and output should match", + ) for i in range(len(self._input_data)): self._tester.assertFalse( - (self._input_data[i] != - self._results[i].as_numpy('OUTPUT0')).any(), - "the output data should match with the input data") + (self._input_data[i] != self._results[i].as_numpy("OUTPUT0")).any(), + "the output data should match with the input data", + ) def join(self): self._req_thread.join() @@ -158,17 +168,16 @@ def join(self): class RateLimiterTest(su.SequenceBatcherTestUtil): - def stress_models(self, model_names, delay_ms=0): infer_counts = {} try: runners = [] for model_name in model_names: runners.append( - AsyncGrpcRunner(self, - "localhost:8001", - model_name, - delay_ms=delay_ms)) + AsyncGrpcRunner( + self, "localhost:8001", model_name, delay_ms=delay_ms + ) + ) for r in runners: r.start() for r in runners: @@ -191,7 +200,7 @@ def test_single_model(self): def test_cross_model_prioritization_limited_resource(self): # Sends requests to two models, one operating at # priority of 1 and other at 2 respectively. - # The availabe resource counts doesn't allow models + # The available resource counts doesn't allow models # to execute simultaneously. model_names = ["custom_zero_1_float32", "custom_zero_1_float32_v2"] @@ -199,32 +208,36 @@ def test_cross_model_prioritization_limited_resource(self): # TODO: Validate the priority and resource counts are set correctly infer_counts = self.stress_models(model_names) - infer_ratio = infer_counts[model_names[0]] / float( - infer_counts[model_names[1]]) + infer_ratio = infer_counts[model_names[0]] / float(infer_counts[model_names[1]]) self.assertGreater( - infer_ratio, 1.80, + infer_ratio, + 1.80, "Got infer ratio across models {}, expected closer to 2".format( - infer_ratio)) + infer_ratio + ), + ) def test_cross_model_prioritization_plenty_resource(self): # Sends requests to two models, one operating at # priority of 1 and other at 2 respectively. - # The availabe resource counts wll allow both models - # to run simulataneously. + # The available resource counts wll allow both models + # to run simultaneously. model_names = ["custom_zero_1_float32", "custom_zero_1_float32_v2"] # TODO: Validate the priority and resource counts are set correctly infer_counts = self.stress_models(model_names) - infer_diff = abs(infer_counts[model_names[0]] - - infer_counts[model_names[1]]) + infer_diff = abs(infer_counts[model_names[0]] - infer_counts[model_names[1]]) self.assertGreater( - 10, infer_diff, - "Got infer difference between models {}, expected closer to 0". - format(infer_diff)) + 10, + infer_diff, + "Got infer difference between models {}, expected closer to 0".format( + infer_diff + ), + ) def test_single_model_dynamic_batching(self): # Send all the inference requests with a delay to a model @@ -242,18 +255,25 @@ def test_single_model_dynamic_batching(self): batch_stats = stats.model_stats[0].batch_stats self.assertEqual( - len(batch_stats), 1, - "expected single batch-size, got {}".format(len(batch_stats))) + len(batch_stats), + 1, + "expected single batch-size, got {}".format(len(batch_stats)), + ) for batch_stat in batch_stats: self.assertEqual( - batch_stat.batch_size, 4, - "unexpected batch-size {}".format(batch_stat.batch_size)) + batch_stat.batch_size, + 4, + "unexpected batch-size {}".format(batch_stat.batch_size), + ) # Get count from one of the stats self.assertEqual( - batch_stat.compute_infer.count, _inference_count / 4, - "expected model-execution-count {} for batch size {}, got {}". - format(_inference_count / 4, 4, batch_stat.compute_infer.count)) + batch_stat.compute_infer.count, + _inference_count / 4, + "expected model-execution-count {} for batch size {}, got {}".format( + _inference_count / 4, 4, batch_stat.compute_infer.count + ), + ) def test_single_model_sequence_batching(self): # Send one sequence and check for correct accumulator @@ -265,19 +285,26 @@ def test_single_model_sequence_batching(self): model_name = "custom_sequence_int32" self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.check_sequence( - 'custom', + "custom", model_name, np.int32, 5, (4000, None), # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay)) - (("start", 1, None, None), (None, 2, None, None), - (None, 3, None, None), (None, 4, None, None), - (None, 5, None, None), (None, 6, None, None), - (None, 7, None, None), (None, 8, None, None), - ("end", 9, None, None)), + ( + ("start", 1, None, None), + (None, 2, None, None), + (None, 3, None, None), + (None, 4, None, None), + (None, 5, None, None), + (None, 6, None, None), + (None, 7, None, None), + (None, 8, None, None), + ("end", 9, None, None), + ), 45, - 'grpc') + "grpc", + ) self.check_deferred_exception() self.check_status(model_name, {1: 9}, 9, 9) @@ -285,5 +312,5 @@ def test_single_model_sequence_batching(self): self.assertTrue(False, "unexpected error {}".format(ex)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_rate_limiter/test.sh b/qa/L0_rate_limiter/test.sh old mode 100644 new mode 100755 index 9a23822056..334af99e4c --- a/qa/L0_rate_limiter/test.sh +++ b/qa/L0_rate_limiter/test.sh @@ -102,12 +102,17 @@ if [ "$SERVER_PID" != "0" ]; then kill $SERVER_PID wait $SERVER_PID fi + +set +e grep "Resource count for \"resource1\" is limited to 1 which will prevent scheduling of one or more model instances, the minimum required count is 4" $SERVER_LOG if [ $? -ne 0 ]; then + cat $SERVER_LOG echo -e "\n***\n*** Failed. Expected error message while loading the model \"custom_zero_1_float32\"\n***" RET=1 fi +set -e + # Case2: resources sufficient only for one model SERVER_ARGS="--rate-limit=execution_count --rate-limit-resource=resource1:3 --rate-limit-resource=resource2:2 --model-repository=$MODELDIR/custom_models" SERVER_LOG="./inference_server_r3.log" @@ -119,12 +124,17 @@ if [ "$SERVER_PID" != "0" ]; then kill $SERVER_PID wait $SERVER_PID fi + +set +e grep "Resource count for \"resource1\" is limited to 3 which will prevent scheduling of one or more model instances, the minimum required count is 4" $SERVER_LOG if [ $? -ne 0 ]; then + cat $SERVER_LOG echo -e "\n***\n*** Failed. Expected error message while loading the model \"custom_zero_1_float32\"\n***" RET=1 fi +set -e + # Case3: Resource specified only for specific device id 10 and not for the GPU that loads the model instance. SERVER_ARGS="--rate-limit=execution_count --rate-limit-resource=resource1:10:10 --rate-limit-resource=resource2:2 --model-repository=$MODELDIR/custom_models" SERVER_LOG="./inference_server_rdevice.log" @@ -136,12 +146,17 @@ if [ "$SERVER_PID" != "0" ]; then kill $SERVER_PID wait $SERVER_PID fi + +set +e grep "Resource count for \"resource1\" is limited to 0 which will prevent scheduling of one or more model instances, the minimum required count is 4" $SERVER_LOG if [ $? -ne 0 ]; then + cat $SERVER_LOG echo -e "\n***\n*** Failed. Expected error message while loading the model \"custom_zero_1_float32\"\n***" RET=1 fi +set -e + # Case4: Conflicting resource types in the config cp -r ./custom_models/custom_zero_1_float32_v2 ./custom_models/custom_zero_1_float32_v3 (cd custom_models/custom_zero_1_float32_v3 && \ @@ -158,13 +173,18 @@ if [ "$SERVER_PID" != "0" ]; then kill $SERVER_PID wait $SERVER_PID fi + +set +e grep "Resource \"resource2\" is present as both global and device-specific resource in the model configuration." $SERVER_LOG if [ $? -ne 0 ]; then + cat $SERVER_LOG echo -e "\n***\n*** Failed. Expected error message for conflicting resource types\n***" RET=1 fi rm -rf ./custom_models/custom_zero_1_float32_v3 +set -e + ## ## Tests with cross-model prioritization with various cases: ## @@ -258,7 +278,7 @@ kill $SERVER_PID wait $SERVER_PID ## -## Tests with mulitple instances of the same model +## Tests with multiple instances of the same model ## # Replace the second model with a second instance with same resource requirements and priority. # TODO: Currently there is no way to check which instance got to run inferences hence we only diff --git a/qa/L0_register/test.sh b/qa/L0_register/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_repoagent_checksum/identity_test.py b/qa/L0_repoagent_checksum/identity_test.py old mode 100644 new mode 100755 index ad9f268967..4db55e0d45 --- a/qa/L0_repoagent_checksum/identity_test.py +++ b/qa/L0_repoagent_checksum/identity_test.py @@ -27,40 +27,43 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import argparse -import numpy as np import sys + +import numpy as np import tritongrpcclient as grpcclient import tritonhttpclient as httpclient from tritonclientutils import np_to_triton_dtype FLAGS = None -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() - parser.add_argument('-v', - '--verbose', - action="store_true", - required=False, - default=False, - help='Enable verbose output') - parser.add_argument('-u', - '--url', - type=str, - required=False, - help='Inference server URL.') parser.add_argument( - '-i', - '--protocol', + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-u", "--url", type=str, required=False, help="Inference server URL." + ) + parser.add_argument( + "-i", + "--protocol", type=str, required=False, - default='http', - help='Protocol ("http"/"grpc") used to ' + - 'communicate with inference service. Default is "http".') + default="http", + help='Protocol ("http"/"grpc") used to ' + + 'communicate with inference service. Default is "http".', + ) FLAGS = parser.parse_args() if (FLAGS.protocol != "http") and (FLAGS.protocol != "grpc"): - print("unexpected protocol \"{}\", expects \"http\" or \"grpc\"".format( - FLAGS.protocol)) + print( + 'unexpected protocol "{}", expects "http" or "grpc"'.format(FLAGS.protocol) + ) exit(1) client_util = httpclient if FLAGS.protocol == "http" else grpcclient @@ -69,23 +72,23 @@ FLAGS.url = "localhost:8000" if FLAGS.protocol == "http" else "localhost:8001" # Reuse a single client for all sync tests - with client_util.InferenceServerClient(FLAGS.url, - verbose=FLAGS.verbose) as client: + with client_util.InferenceServerClient(FLAGS.url, verbose=FLAGS.verbose) as client: for model_name, np_dtype, shape in ( - # yapf: disable + # yapf: disable ("identity_int32", np.int32, [0]), - ("identity_int32", np.int32, [7])): + ("identity_int32", np.int32, [7]) + ): # yapf: enable if np_dtype != object: input_data = (16384 * np.random.randn(*shape)).astype(np_dtype) else: - in0 = (16384 * np.ones(shape, dtype='int')) - in0n = np.array([str(x) for x in in0.reshape(in0.size)], - dtype=object) + in0 = 16384 * np.ones(shape, dtype="int") + in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object) input_data = in0n.reshape(in0.shape) inputs = [ - client_util.InferInput("INPUT0", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + client_util.InferInput( + "INPUT0", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) @@ -102,6 +105,9 @@ output_data = np.char.decode(output_data) if not np.array_equal(output_data, input_data): - print("error: expected output {} to match input {}".format( - output_data, input_data)) + print( + "error: expected output {} to match input {}".format( + output_data, input_data + ) + ) sys.exit(1) diff --git a/qa/L0_request_cancellation/grpc_cancellation_test.py b/qa/L0_request_cancellation/grpc_cancellation_test.py new file mode 100755 index 0000000000..fadaa291e8 --- /dev/null +++ b/qa/L0_request_cancellation/grpc_cancellation_test.py @@ -0,0 +1,141 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import asyncio +import queue +import time +import unittest +from functools import partial + +import numpy as np +import tritonclient.grpc as grpcclient +import tritonclient.grpc.aio as grpcclientaio +from tritonclient.utils import InferenceServerException + + +class UserData: + def __init__(self): + self._completed_requests = queue.Queue() + + +def callback(user_data, result, error): + if error: + user_data._completed_requests.put(error) + else: + user_data._completed_requests.put(result) + + +class GrpcCancellationTest(unittest.IsolatedAsyncioTestCase): + _model_name = "custom_identity_int32" + _model_delay = 10.0 # seconds + _grpc_params = {"url": "localhost:8001", "verbose": True} + + def setUp(self): + self._client = grpcclient.InferenceServerClient(**self._grpc_params) + self._client_aio = grpcclientaio.InferenceServerClient(**self._grpc_params) + self._user_data = UserData() + self._callback = partial(callback, self._user_data) + self._prepare_request() + self._start_time = time.time() # seconds + + def tearDown(self): + self._end_time = time.time() # seconds + self._assert_max_duration() + + def _prepare_request(self): + self._inputs = [] + self._inputs.append(grpcclient.InferInput("INPUT0", [1, 1], "INT32")) + self._outputs = [] + self._outputs.append(grpcclient.InferRequestedOutput("OUTPUT0")) + self._inputs[0].set_data_from_numpy(np.array([[10]], dtype=np.int32)) + + def _assert_max_duration(self): + max_duration = self._model_delay * 0.5 # seconds + duration = self._end_time - self._start_time # seconds + self.assertLess( + duration, + max_duration, + f"test runtime expected less than {max_duration}s response time, got {duration}s", + ) + + def _assert_callback_cancelled(self): + self.assertFalse(self._user_data._completed_requests.empty()) + data_item = self._user_data._completed_requests.get() + self.assertIsInstance(data_item, InferenceServerException) + self.assertIn("Locally cancelled by application!", str(data_item)) + + def test_grpc_async_infer(self): + future = self._client.async_infer( + model_name=self._model_name, + inputs=self._inputs, + callback=self._callback, + outputs=self._outputs, + ) + time.sleep(2) # ensure the inference has started + future.cancel() + time.sleep(0.1) # context switch + self._assert_callback_cancelled() + + def test_grpc_stream_infer(self): + self._client.start_stream(callback=self._callback) + self._client.async_stream_infer( + model_name=self._model_name, inputs=self._inputs, outputs=self._outputs + ) + time.sleep(2) # ensure the inference has started + self._client.stop_stream(cancel_requests=True) + self._assert_callback_cancelled() + + async def test_aio_grpc_async_infer(self): + infer_task = asyncio.create_task( + self._client_aio.infer( + model_name=self._model_name, inputs=self._inputs, outputs=self._outputs + ) + ) + await asyncio.sleep(2) # ensure the inference has started + infer_task.cancel() + with self.assertRaises(asyncio.CancelledError): + await infer_task + + async def test_aio_grpc_stream_infer(self): + async def requests_generator(): + yield { + "model_name": self._model_name, + "inputs": self._inputs, + "outputs": self._outputs, + } + + responses_iterator = self._client_aio.stream_infer(requests_generator()) + await asyncio.sleep(2) # ensure the inference has started + self.assertTrue(responses_iterator.cancel()) + with self.assertRaises(asyncio.CancelledError): + async for result, error in responses_iterator: + self._callback(result, error) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_request_cancellation/scheduler_test.py b/qa/L0_request_cancellation/scheduler_test.py new file mode 100755 index 0000000000..a6cd97efaa --- /dev/null +++ b/qa/L0_request_cancellation/scheduler_test.py @@ -0,0 +1,233 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import concurrent.futures +import time +import unittest + +import numpy as np +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException + + +class TestScheduler(unittest.TestCase): + def setUp(self): + # Initialize client + self._triton = grpcclient.InferenceServerClient("localhost:8001") + + def _get_inputs(self, batch_size): + self.assertIsInstance(batch_size, int) + self.assertGreater(batch_size, 0) + shape = [batch_size, 8] + inputs = [grpcclient.InferInput("INPUT0", shape, "FP32")] + inputs[0].set_data_from_numpy(np.ones(shape, dtype=np.float32)) + return inputs + + def _generate_callback_and_response_pair(self): + response = {"responded": False, "result": None, "error": None} + + def callback(result, error): + response["responded"] = True + response["result"] = result + response["error"] = error + + return callback, response + + def _assert_response_is_cancelled(self, response): + self.assertTrue(response["responded"]) + self.assertEqual(response["result"], None) + self.assertIsInstance(response["error"], InferenceServerException) + self.assertEqual(response["error"].status(), "StatusCode.CANCELLED") + + def _generate_streaming_callback_and_response_pair(self): + response = [] # [{"result": result, "error": error}, ...] + + def callback(result, error): + response.append({"result": result, "error": error}) + + return callback, response + + def _assert_streaming_response_is_cancelled(self, response): + self.assertGreater(len(response), 0) + cancelled_count = 0 + for res in response: + result, error = res["result"], res["error"] + if error: + self.assertEqual(result, None) + self.assertIsInstance(error, InferenceServerException) + if error.status() == "StatusCode.CANCELLED": + cancelled_count += 1 + self.assertEqual(cancelled_count, 1) + + # Test queued requests on dynamic batch scheduler can be cancelled + def test_dynamic_batch_scheduler_request_cancellation(self): + model_name = "dynamic_batch" + with concurrent.futures.ThreadPoolExecutor() as pool: + # Saturate the 2 batch slots on the model of 1 instance + saturate_thread_1 = pool.submit( + self._triton.infer, model_name, self._get_inputs(batch_size=1) + ) + saturate_thread_2 = pool.submit( + self._triton.infer, model_name, self._get_inputs(batch_size=1) + ) + time.sleep(2) # ensure the slots are filled + # The next request should be queued + callback, response = self._generate_callback_and_response_pair() + queue_future = self._triton.async_infer( + model_name, self._get_inputs(batch_size=1), callback + ) + time.sleep(2) # ensure the request is queued + self.assertFalse(response["responded"]) + # Cancel the queued request + queue_future.cancel() + time.sleep(2) # ensure the cancellation is delivered + self._assert_response_is_cancelled(response) + # Join saturating thread + saturate_thread_1.result() + saturate_thread_2.result() + + # Test backlogged requests on sequence batch scheduler can be cancelled + def test_sequence_batch_scheduler_backlog_request_cancellation(self): + model_name = "sequence_direct" + with concurrent.futures.ThreadPoolExecutor() as pool: + # Saturate the single sequence slot + saturate_thread = pool.submit( + self._triton.infer, + model_name, + self._get_inputs(batch_size=1), + sequence_id=1, + sequence_start=True, + ) + time.sleep(2) # ensure the slot is filled + # The next sequence with 2 requests should be on the backlog + backlog_requests = [] + for i in range(2): + callback, response = self._generate_callback_and_response_pair() + backlog_future = self._triton.async_infer( + model_name, + self._get_inputs(batch_size=1), + callback, + sequence_id=2, + sequence_start=(True if i == 0 else False), + ) + backlog_requests.append( + {"future": backlog_future, "response": response} + ) + time.sleep(2) # ensure the sequence is backlogged + self.assertFalse(backlog_requests[0]["response"]["responded"]) + self.assertFalse(backlog_requests[1]["response"]["responded"]) + # Cancelling any backlogged request cancels the entire sequence + backlog_requests[0]["future"].cancel() + time.sleep(2) # ensure the cancellation is delivered + time.sleep(2) # ensure reaper thread has responded + self._assert_response_is_cancelled(backlog_requests[0]["response"]) + self._assert_response_is_cancelled(backlog_requests[1]["response"]) + # Join saturating thread + saturate_thread.result() + + # Test queued requests on direct sequence batch scheduler can be cancelled + def test_direct_sequence_batch_scheduler_request_cancellation(self): + model_name = "sequence_direct" + self._test_sequence_batch_scheduler_queued_request_cancellation(model_name) + + # Test queued requests on oldest sequence batch scheduler can be cancelled + def test_oldest_sequence_batch_scheduler_request_cancellation(self): + model_name = "sequence_oldest" + self._test_sequence_batch_scheduler_queued_request_cancellation(model_name) + + # Helper function + def _test_sequence_batch_scheduler_queued_request_cancellation(self, model_name): + with concurrent.futures.ThreadPoolExecutor() as pool: + # Start the sequence + start_thread = pool.submit( + self._triton.infer, + model_name, + self._get_inputs(batch_size=1), + sequence_id=1, + sequence_start=True, + ) + time.sleep(2) # ensure the sequence has started + # The next 2 requests should be queued + queue_requests = [] + for i in range(2): + callback, response = self._generate_callback_and_response_pair() + queue_future = self._triton.async_infer( + model_name, self._get_inputs(batch_size=1), callback, sequence_id=1 + ) + queue_requests.append({"future": queue_future, "response": response}) + time.sleep(2) # ensure the requests are queued + self.assertFalse(queue_requests[0]["response"]["responded"]) + self.assertFalse(queue_requests[1]["response"]["responded"]) + # Cancelling any queued request cancels the entire sequence + queue_requests[0]["future"].cancel() + time.sleep(2) # ensure the cancellation is delivered + time.sleep(2) # ensure reaper thread has responded + self._assert_response_is_cancelled(queue_requests[0]["response"]) + self._assert_response_is_cancelled(queue_requests[1]["response"]) + # Join start thread + start_thread.result() + + # Test ensemble scheduler will propagate cancellation request to child + def test_ensemble_scheduler_request_cancellation(self): + model_name = "ensemble_model" + callback, response = self._generate_callback_and_response_pair() + infer_future = self._triton.async_infer( + model_name, self._get_inputs(batch_size=1), callback + ) + time.sleep(2) # ensure the inference has started + self.assertFalse(response["responded"]) + infer_future.cancel() + time.sleep(2) # ensure the cancellation is delivered + self._assert_response_is_cancelled(response) + + # Test cancellation on multiple gRPC streaming sequences + def test_scheduler_streaming_request_cancellation(self): + model_name = "sequence_oldest" + # Start 2 sequences with many requests + callback, response = self._generate_streaming_callback_and_response_pair() + self._triton.start_stream(callback) + for sequence_id in [1, 2]: + sequence_start = True + for request_id in range(16): + self._triton.async_stream_infer( + model_name, + self._get_inputs(batch_size=1), + sequence_id=sequence_id, + sequence_start=sequence_start, + ) + sequence_start = False + time.sleep(2) # ensure the requests are delivered + # Cancelling the stream cancels all requests on the stream + self._triton.stop_stream(cancel_requests=True) + time.sleep(2) # ensure the cancellation is delivered + time.sleep(2) # ensure reaper thread has responded + self._assert_streaming_response_is_cancelled(response) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_request_cancellation/test.sh b/qa/L0_request_cancellation/test.sh new file mode 100755 index 0000000000..4929be3a5f --- /dev/null +++ b/qa/L0_request_cancellation/test.sh @@ -0,0 +1,183 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +export CUDA_VISIBLE_DEVICES=0 + +SERVER=/opt/tritonserver/bin/tritonserver +source ../common/util.sh + +RET=0 + +# +# Unit tests +# +rm -rf models && mkdir models +mkdir -p models/model/1 && (cd models/model && \ + echo 'name: "model"' >> config.pbtxt && \ + echo 'backend: "identity"' >> config.pbtxt && \ + echo 'max_batch_size: 64' >> config.pbtxt && \ + echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_INT32 \n dims: [ 1000 ] }]' >> config.pbtxt && \ + echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_INT32 \n dims: [ 1000 ] }]' >> config.pbtxt && \ + echo 'instance_group [{ kind: KIND_CPU }]' >> config.pbtxt) + +SERVER_LOG=server.log +LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH ./request_cancellation_test > $SERVER_LOG +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Unit Tests Failed\n***" + cat $SERVER_LOG + RET=1 +fi + +# +# gRPC cancellation tests +# +rm -rf models && mkdir models +mkdir -p models/custom_identity_int32/1 && (cd models/custom_identity_int32 && \ + echo 'name: "custom_identity_int32"' >> config.pbtxt && \ + echo 'backend: "identity"' >> config.pbtxt && \ + echo 'max_batch_size: 1024' >> config.pbtxt && \ + echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_INT32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_INT32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo 'instance_group [{ kind: KIND_CPU }]' >> config.pbtxt && \ + echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "10000" } }]' >> config.pbtxt) + +for TEST_CASE in "test_grpc_async_infer" "test_grpc_stream_infer" "test_aio_grpc_async_infer" "test_aio_grpc_stream_infer"; do + + TEST_LOG="./grpc_cancellation_test.$TEST_CASE.log" + SERVER_LOG="grpc_cancellation_test.$TEST_CASE.server.log" + + SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1" + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + set +e + python grpc_cancellation_test.py GrpcCancellationTest.$TEST_CASE > $TEST_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** gRPC Cancellation Tests Failed on $TEST_CASE\n***" + cat $TEST_LOG + RET=1 + fi + grep "Cancellation notification received for" $SERVER_LOG + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Cancellation not received by server on $TEST_CASE\n***" + cat $SERVER_LOG + RET=1 + fi + set -e + + kill $SERVER_PID + wait $SERVER_PID +done + +# +# End-to-end scheduler tests +# +rm -rf models && mkdir models +mkdir -p models/dynamic_batch/1 && (cd models/dynamic_batch && \ + echo 'name: "dynamic_batch"' >> config.pbtxt && \ + echo 'backend: "identity"' >> config.pbtxt && \ + echo 'max_batch_size: 2' >> config.pbtxt && \ + echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'instance_group [{ count: 1 \n kind: KIND_CPU }]' >> config.pbtxt && \ + echo -e 'dynamic_batching { max_queue_delay_microseconds: 600000 }' >> config.pbtxt && \ + echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "6000" } }]' >> config.pbtxt) +mkdir -p models/sequence_direct/1 && (cd models/sequence_direct && \ + echo 'name: "sequence_direct"' >> config.pbtxt && \ + echo 'backend: "identity"' >> config.pbtxt && \ + echo 'max_batch_size: 1' >> config.pbtxt && \ + echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'instance_group [{ count: 1 \n kind: KIND_CPU }]' >> config.pbtxt && \ + echo -e 'sequence_batching { direct { } \n max_sequence_idle_microseconds: 6000000 }' >> config.pbtxt && \ + echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "6000" } }]' >> config.pbtxt) +mkdir -p models/sequence_oldest/1 && (cd models/sequence_oldest && \ + echo 'name: "sequence_oldest"' >> config.pbtxt && \ + echo 'backend: "identity"' >> config.pbtxt && \ + echo 'max_batch_size: 1' >> config.pbtxt && \ + echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'instance_group [{ count: 1 \n kind: KIND_CPU }]' >> config.pbtxt && \ + echo -e 'sequence_batching { oldest { max_candidate_sequences: 1 } \n max_sequence_idle_microseconds: 6000000 }' >> config.pbtxt && \ + echo -e 'parameters [{ key: "execute_delay_ms" \n value: { string_value: "6000" } }]' >> config.pbtxt) +mkdir -p models/ensemble_model/1 && (cd models/ensemble_model && \ + echo 'name: "ensemble_model"' >> config.pbtxt && \ + echo 'platform: "ensemble"' >> config.pbtxt && \ + echo 'max_batch_size: 1' >> config.pbtxt && \ + echo -e 'input [{ name: "INPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo -e 'output [{ name: "OUTPUT0" \n data_type: TYPE_FP32 \n dims: [ -1 ] }]' >> config.pbtxt && \ + echo 'ensemble_scheduling { step [' >> config.pbtxt && \ + echo -e '{ model_name: "dynamic_batch" \n model_version: -1 \n input_map { key: "INPUT0" \n value: "INPUT0" } \n output_map { key: "OUTPUT0" \n value: "out" } },' >> config.pbtxt && \ + echo -e '{ model_name: "dynamic_batch" \n model_version: -1 \n input_map { key: "INPUT0" \n value: "out" } \n output_map { key: "OUTPUT0" \n value: "OUTPUT0" } }' >> config.pbtxt && \ + echo '] }' >> config.pbtxt) + +TEST_LOG="scheduler_test.log" +SERVER_LOG="./scheduler_test.server.log" + +SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=2" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python scheduler_test.py > $TEST_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Scheduler Tests Failed\n***" + cat $TEST_LOG + RET=1 +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi +exit $RET diff --git a/qa/L0_response_cache/models/decoupled_cache/config.pbtxt b/qa/L0_response_cache/models/decoupled_cache/config.pbtxt new file mode 100644 index 0000000000..c243e72861 --- /dev/null +++ b/qa/L0_response_cache/models/decoupled_cache/config.pbtxt @@ -0,0 +1,49 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "identity" +max_batch_size: 0 +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +model_transaction_policy { + decoupled: True +} +response_cache { + enable: True +} diff --git a/qa/L0_response_cache/models/identity_cache/config.pbtxt b/qa/L0_response_cache/models/identity_cache/config.pbtxt new file mode 100644 index 0000000000..7ba5cf2afb --- /dev/null +++ b/qa/L0_response_cache/models/identity_cache/config.pbtxt @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "identity" +max_batch_size: 0 +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +response_cache { + enable: True +} diff --git a/qa/L0_response_cache/test.sh b/qa/L0_response_cache/test.sh index c13858226d..434195b693 100755 --- a/qa/L0_response_cache/test.sh +++ b/qa/L0_response_cache/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -29,19 +29,254 @@ RET=0 TEST_LOG="./response_cache_test.log" UNIT_TEST=./response_cache_test +export CUDA_VISIBLE_DEVICES=0 + +# Only localhost supported in this test for now, but in future could make +# use of a persistent remote redis server, or similarly use --replicaof arg. +export TRITON_REDIS_HOST="localhost" +export TRITON_REDIS_PORT="6379" +REDIS_LOG="./redis-server.unit_tests.log" rm -fr *.log +function install_redis() { + ## Install redis if not already installed + if ! command -v redis-server >/dev/null 2>&1; then + apt update -y && apt install -y redis + fi +} + +function start_redis() { + # Run redis server in background + redis-server \ + --daemonize yes \ + --port "${TRITON_REDIS_PORT}" \ + --logfile "${REDIS_LOG}" \ + --loglevel debug + + # Check redis server is running + REDIS_PING_RESPONSE=$(redis-cli -h ${TRITON_REDIS_HOST} -p ${TRITON_REDIS_PORT} ping) + if [ "${REDIS_PING_RESPONSE}" == "PONG" ]; then + echo "Redis successfully started in background" + else + echo -e "\n***\n*** Failed: Redis server did not start successfully\n***" + RET=1 + fi +} + +function stop_redis() { + echo "Stopping Redis server..." + redis-cli -h "${TRITON_REDIS_HOST}" -p "${TRITON_REDIS_PORT}" shutdown || true + echo "Redis server shutdown" +} + +function set_redis_auth() { + # NOTE: Per-user auth [Access Control List (ACL)] is only supported in + # Redis >= 6.0 and is more comprehensive in what can be configured. + # For simplicity and wider range of Redis version support, use + # server-wide password via "requirepass" for now. + redis-cli -h "${TRITON_REDIS_HOST}" -p "${TRITON_REDIS_PORT}" config set requirepass "${REDIS_PW}" + export REDISCLI_AUTH="${REDIS_PW}" +} + +function unset_redis_auth() { + # Authenticate implicitly via REDISCLI_AUTH env var, then unset password/var + redis-cli -h "${TRITON_REDIS_HOST}" -p "${TRITON_REDIS_PORT}" config set requirepass "" + unset REDISCLI_AUTH +} + +# UNIT TESTS set +e -export CUDA_VISIBLE_DEVICES=0 + +## Unit tests currently run for both Local and Redis cache implementations +## by default. However, we could break out the unit tests for each +## into separate runs gtest filters if needed in the future: +## - `${UNIT_TEST} --gtest_filter=*Local*` +## - `${UNIT_TEST} --gtest_filter=*Redis*` +install_redis +# Stop any existing redis server first for good measure +stop_redis +start_redis LD_LIBRARY_PATH=/opt/tritonserver/lib:$LD_LIBRARY_PATH $UNIT_TEST >>$TEST_LOG 2>&1 if [ $? -ne 0 ]; then cat $TEST_LOG echo -e "\n***\n*** Response Cache Unit Test Failed\n***" RET=1 fi +stop_redis set -e +# SERVER TESTS +function check_server_success_and_kill { + if [ "${SERVER_PID}" == "0" ]; then + echo -e "\n***\n*** Failed to start ${SERVER}\n***" + cat ${SERVER_LOG} + RET=1 + else + kill ${SERVER_PID} + wait ${SERVER_PID} + fi +} + +function check_server_expected_failure { + EXPECTED_MESSAGE="${1}" + if [ "${SERVER_PID}" != "0" ]; then + echo -e "\n***\n*** Failed: ${SERVER} started successfully when it was expected to fail\n***" + cat ${SERVER_LOG} + RET=1 + + kill ${SERVER_PID} + wait ${SERVER_PID} + else + # Check that server fails with the correct error message + set +e + grep -i "${EXPECTED_MESSAGE}" ${SERVER_LOG} + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed: Expected [${EXPECTED_MESSAGE}] error message in output\n***" + cat $SERVER_LOG + RET=1 + fi + set -e + fi +} + +MODEL_DIR="${PWD}/models" +mkdir -p "${MODEL_DIR}/decoupled_cache/1" +mkdir -p "${MODEL_DIR}/identity_cache/1" + +# Check that server fails to start for a "decoupled" model with cache enabled +EXTRA_ARGS="--model-control-mode=explicit --load-model=decoupled_cache" + +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_ARGS="--model-repository=${MODEL_DIR} --response-cache-byte-size=8192 ${EXTRA_ARGS}" +SERVER_LOG="./inference_server.log" +source ../common/util.sh +run_server +if [ "$SERVER_PID" != "0" ]; then + echo -e "\n***\n*** Failed: $SERVER started successfully when it was expected to fail\n***" + cat $SERVER_LOG + RET=1 + + kill $SERVER_PID + wait $SERVER_PID +else + # Check that server fails with the correct error message + set +e + grep -i "response cache does not currently support" ${SERVER_LOG} | grep -i "decoupled" + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed: Expected response cache / decoupled mode error message in output\n***" + cat $SERVER_LOG + RET=1 + fi + set -e +fi + +# Test with model expected to load successfully +EXTRA_ARGS="--model-control-mode=explicit --load-model=identity_cache" + +# Test old cache config method +# --response-cache-byte-size must be non-zero to test models with cache enabled +SERVER_ARGS="--model-repository=${MODEL_DIR} --response-cache-byte-size=8192 ${EXTRA_ARGS}" +run_server +check_server_success_and_kill + +# Test new cache config method +SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=local,size=8192 ${EXTRA_ARGS}" +run_server +check_server_success_and_kill + +# Test that specifying multiple cache types is not supported and should fail +SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=local,size=8192 --cache-config=redis,key=value ${EXTRA_ARGS}" +run_server +check_server_expected_failure "multiple cache configurations" + +# Test that specifying both config styles is incompatible and should fail +SERVER_ARGS="--model-repository=${MODEL_DIR} --response-cache-byte-size=12345 --cache-config=local,size=67890 ${EXTRA_ARGS}" +run_server +check_server_expected_failure "incompatible flags" + +## Redis Cache CLI tests +REDIS_ENDPOINT="--cache-config redis,host=${TRITON_REDIS_HOST} --cache-config redis,port=${TRITON_REDIS_PORT}" +REDIS_LOG="./redis-server.cli_tests.log" +start_redis + +# Test simple redis cache config succeeds +SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}" +run_server +check_server_success_and_kill + +# Test triton fails to initialize if it can't connect to redis cache +SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=redis,host=localhost --cache-config=redis,port=nonexistent ${EXTRA_ARGS}" +run_server +check_server_expected_failure "Failed to connect to Redis: Connection refused" + +# Test triton fails to initialize if it can't resolve host for redis cache +SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=redis,host=nonexistent --cache-config=redis,port=nonexistent ${EXTRA_ARGS}" +run_server +# Either of these errors can be returned for bad hostname, so check for either. +MSG1="Temporary failure in name resolution" +MSG2="Name or service not known" +check_server_expected_failure "${MSG1}\|${MSG2}" + +# Test triton fails to initialize if minimum required args (host & port) not all provided +SERVER_ARGS="--model-repository=${MODEL_DIR} --cache-config=redis,port=${TRITON_REDIS_HOST} ${EXTRA_ARGS}" +run_server +check_server_expected_failure "Must at a minimum specify" + +## Redis Authentication tests + +# Automatically provide auth via REDISCLI_AUTH env var when set: https://redis.io/docs/ui/cli/ +REDIS_PW="redis123!" +set_redis_auth + +### Credentials via command-line + +# Test simple redis authentication succeeds with correct credentials +REDIS_CACHE_AUTH="--cache-config redis,password=${REDIS_PW}" +SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${REDIS_CACHE_AUTH} ${EXTRA_ARGS}" +run_server +check_server_success_and_kill + +# Test simple redis authentication fails with wrong credentials +REDIS_CACHE_AUTH="--cache-config redis,password=wrong" +SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${REDIS_CACHE_AUTH} ${EXTRA_ARGS}" +run_server +check_server_expected_failure "WRONGPASS" + +# Test simple redis authentication fails with no credentials +SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}" +run_server +check_server_expected_failure "NOAUTH Authentication required" + +### Credentials via environment variables + +# Test simple redis authentication succeeds with password-only via env vars +# No username means use "default" as the username +unset TRITONCACHE_REDIS_USERNAME +export TRITONCACHE_REDIS_PASSWORD="${REDIS_PW}" +SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}" +run_server +check_server_success_and_kill + +# Test simple redis authentication succeeds with correct user and password via env vars +export TRITONCACHE_REDIS_USERNAME="default" +export TRITONCACHE_REDIS_PASSWORD="${REDIS_PW}" +SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}" +run_server +check_server_success_and_kill + +# Test simple redis authentication fails with wrong credentials via env vars +export TRITONCACHE_REDIS_PASSWORD="wrong" +SERVER_ARGS="--model-repository=${MODEL_DIR} ${REDIS_ENDPOINT} ${EXTRA_ARGS}" +run_server +check_server_expected_failure "WRONGPASS" +unset TRITONCACHE_REDIS_USERNAME +unset TRITONCACHE_REDIS_PASSWORD + +# Clean up redis server before exiting test +unset_redis_auth +stop_redis + if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else diff --git a/qa/L0_sagemaker/sagemaker_multi_model_test.py b/qa/L0_sagemaker/sagemaker_multi_model_test.py old mode 100644 new mode 100755 index 820562c1da..b2052f6751 --- a/qa/L0_sagemaker/sagemaker_multi_model_test.py +++ b/qa/L0_sagemaker/sagemaker_multi_model_test.py @@ -1,5 +1,5 @@ #!/usr/bin/python -# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -29,27 +29,20 @@ sys.path.append("../common") +import json import os -import shutil +import sys import time import unittest + import numpy as np -import infer_util as iu +import requests import test_util as tu import tritonclient.http as httpclient -import argparse -import csv -import json -import os -import requests -import socket -import sys - class SageMakerMultiModelTest(tu.TestResultCollector): def setUp(self): - SAGEMAKER_BIND_TO_PORT = os.getenv("SAGEMAKER_BIND_TO_PORT", "8080") self.url_mme_ = "http://localhost:{}/models".format(SAGEMAKER_BIND_TO_PORT) @@ -58,15 +51,59 @@ def setUp(self): self.model1_url = "/opt/ml/models/123456789abcdefghi/model" self.model1_input_data_ = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] - self.model1_expected_output0_data_ = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30] - self.model1_expected_output1_data_ = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] + self.model1_expected_output0_data_ = [ + 0, + 2, + 4, + 6, + 8, + 10, + 12, + 14, + 16, + 18, + 20, + 22, + 24, + 26, + 28, + 30, + ] + self.model1_expected_output1_data_ = [ + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + ] self.model1_expected_result_ = { "model_name": "sm_mme_model_1", "model_version": "1", "outputs": [ - {"name": "OUTPUT0", "datatype": "INT32", "shape": [1, 16], "data": self.model1_expected_output0_data_}, - {"name": "OUTPUT1", "datatype": "INT32", "shape": [1, 16], "data": self.model1_expected_output1_data_}, + { + "name": "OUTPUT0", + "datatype": "INT32", + "shape": [1, 16], + "data": self.model1_expected_output0_data_, + }, + { + "name": "OUTPUT1", + "datatype": "INT32", + "shape": [1, 16], + "data": self.model1_expected_output1_data_, + }, ], } @@ -77,9 +114,15 @@ def setUp(self): # Output is same as input since this is an identity model self.model2_input_data_ = [0, 1, 2, 3, 4, 5, 6, 7] + # ensemble model setup + self.model3_name = "123456789ensemble" + self.model3_url = "/opt/ml/models/123456789ensemble/model" + def test_sm_0_environment_variables_set(self): self.assertEqual( - os.getenv("SAGEMAKER_MULTI_MODEL"), "true", "Variable SAGEMAKER_MULTI_MODEL must be set to true" + os.getenv("SAGEMAKER_MULTI_MODEL"), + "true", + "Variable SAGEMAKER_MULTI_MODEL must be set to true", ) def test_sm_1_model_load(self): @@ -88,35 +131,59 @@ def test_sm_1_model_load(self): headers = {"Content-Type": "application/json"} r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers) time.sleep(5) # wait for model to load - self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code)) + self.assertEqual( + r.status_code, + 200, + "Expected status code 200, received {}".format(r.status_code), + ) # Load the same model again, expect a 409 request_body = {"model_name": self.model1_name, "url": self.model1_url} headers = {"Content-Type": "application/json"} r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers) time.sleep(5) # wait for model to load - self.assertEqual(r.status_code, 409, "Expected status code 409, received {}".format(r.status_code)) + self.assertEqual( + r.status_code, + 409, + "Expected status code 409, received {}".format(r.status_code), + ) # Load model_2 request_body = {"model_name": self.model2_name, "url": self.model2_url} headers = {"Content-Type": "application/json"} r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers) time.sleep(5) # wait for model to load - self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code)) + self.assertEqual( + r.status_code, + 200, + "Expected status code 200, received {}".format(r.status_code), + ) def test_sm_2_model_list(self): r = requests.get(self.url_mme_) time.sleep(3) expected_response_1 = { "models": [ - {"modelName": self.model1_name, "modelUrl": self.model1_url}, - {"modelName": self.model2_name, "modelUrl": self.model2_url}, + { + "modelName": self.model1_name, + "modelUrl": self.model1_url.rstrip("/model"), + }, + { + "modelName": self.model2_name, + "modelUrl": self.model2_url.rstrip("/model"), + }, ] } expected_response_2 = { "models": [ - {"modelName": self.model2_name, "modelUrl": self.model2_url}, - {"modelName": self.model1_name, "modelUrl": self.model1_url}, + { + "modelName": self.model2_name, + "modelUrl": self.model2_url.rstrip("/model"), + }, + { + "modelName": self.model1_name, + "modelUrl": self.model1_url.rstrip("/model"), + }, ] } @@ -124,16 +191,23 @@ def test_sm_2_model_list(self): self.assertIn( r.json(), [expected_response_1, expected_response_2], - "Expected one of {}, received: {}".format([expected_response_1, expected_response_2], r.json()), + "Expected one of {}, received: {}".format( + [expected_response_1, expected_response_2], r.json() + ), ) def test_sm_3_model_get(self): get_url = "{}/{}".format(self.url_mme_, self.model1_name) r = requests.get(get_url) time.sleep(3) - expected_response = {"modelName": self.model1_name, "modelUrl": self.model1_url} + expected_response = { + "modelName": self.model1_name, + "modelUrl": self.model1_url.rstrip("/model"), + } self.assertEqual( - r.json(), expected_response, "Expected response: {}, received: {}".format(expected_response, r.json()) + r.json(), + expected_response, + "Expected response: {}, received: {}".format(expected_response, r.json()), ) def test_sm_4_model_invoke(self): @@ -151,7 +225,9 @@ def test_sm_4_model_invoke(self): outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) - request_body, _ = httpclient.InferenceServerClient.generate_request_body(inputs, outputs=outputs) + request_body, _ = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = {"Content-Type": "application/json"} invoke_url = "{}/{}/invoke".format(self.url_mme_, self.model1_name) @@ -161,7 +237,9 @@ def test_sm_4_model_invoke(self): self.assertEqual( self.model1_expected_result_, r.json(), - "Expected response : {}, received: {}".format(self.model1_expected_result_, r.json()), + "Expected response : {}, received: {}".format( + self.model1_expected_result_, r.json() + ), ) # Invoke model_2 @@ -180,42 +258,121 @@ def test_sm_4_model_invoke(self): outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body(inputs, outputs=outputs) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) invoke_url = "{}/{}/invoke".format(self.url_mme_, self.model2_name) headers = { - "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(header_length) + "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size={}".format( + header_length + ) } r = requests.post(invoke_url, data=request_body, headers=headers) - header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size=" + header_length_prefix = ( + "application/vnd.sagemaker-triton.binary+json;json-header-size=" + ) header_length_str = r.headers["Content-Type"][len(header_length_prefix) :] - result = httpclient.InferenceServerClient.parse_response_body(r._content, header_length=int(header_length_str)) + result = httpclient.InferenceServerClient.parse_response_body( + r._content, header_length=int(header_length_str) + ) # Get the inference header size so we can locate the output binary data output_data = result.as_numpy("OUTPUT0") - + for i in range(8): - self.assertEqual(output_data[0][i], input_data[0][i], "Tensor Value Mismatch") + self.assertEqual( + output_data[0][i], input_data[0][i], "Tensor Value Mismatch" + ) def test_sm_5_model_unload(self): # Unload model_1 unload_url = "{}/{}".format(self.url_mme_, self.model1_name) r = requests.delete(unload_url) time.sleep(3) - self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code)) + self.assertEqual( + r.status_code, + 200, + "Expected status code 200, received {}".format(r.status_code), + ) # Unload model_2 unload_url = "{}/{}".format(self.url_mme_, self.model2_name) r = requests.delete(unload_url) time.sleep(3) - self.assertEqual(r.status_code, 200, "Expected status code 200, received {}".format(r.status_code)) + self.assertEqual( + r.status_code, + 200, + "Expected status code 200, received {}".format(r.status_code), + ) # Unload a non-loaded model, expect a 404 unload_url = "{}/sm_non_loaded_model".format(self.url_mme_) r = requests.delete(unload_url) time.sleep(3) - self.assertEqual(r.status_code, 404, "Expected status code 404, received {}".format(r.status_code)) + self.assertEqual( + r.status_code, + 404, + "Expected status code 404, received {}".format(r.status_code), + ) + + def test_sm_6_ensemble_model(self): + # Load ensemble model + request_body = {"model_name": self.model3_name, "url": self.model3_url} + headers = { + "Content-Type": "application/json", + "X-Amzn-SageMaker-Target-Model": f"{self.model3_name}", + } + r = requests.post(self.url_mme_, data=json.dumps(request_body), headers=headers) + time.sleep(5) # wait for model to load + self.assertEqual( + r.status_code, + 200, + "Expected status code 200, received {}".format(r.status_code), + ) + + # Invoke ensemble model + inputs = [] + outputs = [] + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "FP32")) + + # Initialize the data + input_data = np.array(self.model1_input_data_, dtype=np.float32) + input_data = np.expand_dims(input_data, axis=0) + inputs[0].set_data_from_numpy(input_data, binary_data=False) + inputs[1].set_data_from_numpy(input_data, binary_data=False) + + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + request_body, _ = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) + + headers = {"Content-Type": "application/json"} + invoke_url = "{}/{}/invoke".format(self.url_mme_, self.model3_name) + r = requests.post(invoke_url, data=request_body, headers=headers) + print(f"response: {r.text}") + r.raise_for_status() + self.assertEqual( + r.status_code, + 200, + "Expected status code 200, received {}".format(r.status_code), + ) + + # Unload ensemble model + unload_url = "{}/{}".format(self.url_mme_, self.model3_name) + r = requests.delete(unload_url, headers=headers) + time.sleep(5) + self.assertEqual( + r.status_code, + 200, + "Expected status code 200, received {}".format(r.status_code), + ) if __name__ == "__main__": diff --git a/qa/L0_sagemaker/sagemaker_test.py b/qa/L0_sagemaker/sagemaker_test.py old mode 100644 new mode 100755 index baff8b5528..6e76a9f0fd --- a/qa/L0_sagemaker/sagemaker_test.py +++ b/qa/L0_sagemaker/sagemaker_test.py @@ -1,5 +1,5 @@ #!/usr/bin/python -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,88 +26,98 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") +import json import os -import shutil -import time +import sys import unittest + import numpy as np -import infer_util as iu +import requests import test_util as tu import tritonclient.http as httpclient -import argparse -import csv -import json -import os -import requests -import socket -import sys - class SageMakerTest(tu.TestResultCollector): - def setUp(self): - SAGEMAKER_BIND_TO_PORT = os.getenv('SAGEMAKER_BIND_TO_PORT', '8080') - self.url_ = "http://localhost:{}/invocations".format( - SAGEMAKER_BIND_TO_PORT) - self.input_data_ = [ - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 - ] + SAGEMAKER_BIND_TO_PORT = os.getenv("SAGEMAKER_BIND_TO_PORT", "8080") + self.url_ = "http://localhost:{}/invocations".format(SAGEMAKER_BIND_TO_PORT) + self.input_data_ = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] self.expected_output0_data_ = [ - 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 - ] - self.expected_output1_data_ = [ - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 + 0, + 2, + 4, + 6, + 8, + 10, + 12, + 14, + 16, + 18, + 20, + 22, + 24, + 26, + 28, + 30, ] + self.expected_output1_data_ = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] self.expected_result_ = { - "model_name": - "sm_model", - "model_version": - "1", - "outputs": [{ - "name": "OUTPUT0", - "datatype": "INT32", - "shape": [1, 16], - "data": self.expected_output0_data_ - }, { - "name": "OUTPUT1", - "datatype": "INT32", - "shape": [1, 16], - "data": self.expected_output1_data_ - }] + "model_name": "sm_model", + "model_version": "1", + "outputs": [ + { + "name": "OUTPUT0", + "datatype": "INT32", + "shape": [1, 16], + "data": self.expected_output0_data_, + }, + { + "name": "OUTPUT1", + "datatype": "INT32", + "shape": [1, 16], + "data": self.expected_output1_data_, + }, + ], } def test_direct_inference(self): request = { - "inputs": [{ - "name": "INPUT0", - "datatype": "INT32", - "shape": [1, 16], - "data": self.input_data_ - }, { - "name": "INPUT1", - "datatype": "INT32", - "shape": [1, 16], - "data": self.input_data_ - }] + "inputs": [ + { + "name": "INPUT0", + "datatype": "INT32", + "shape": [1, 16], + "data": self.input_data_, + }, + { + "name": "INPUT1", + "datatype": "INT32", + "shape": [1, 16], + "data": self.input_data_, + }, + ] } - headers = {'Content-Type': 'application/json'} + headers = {"Content-Type": "application/json"} r = requests.post(self.url_, data=json.dumps(request), headers=headers) r.raise_for_status() self.assertEqual( - self.expected_result_, r.json(), + self.expected_result_, + r.json(), "Expected response body: {}; got: {}".format( - self.expected_result_, r.json())) + self.expected_result_, r.json() + ), + ) def test_inference_client_generated_request(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -115,27 +125,29 @@ def test_inference_client_generated_request(self): inputs[0].set_data_from_numpy(input_data, binary_data=False) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) request_body, _ = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + inputs, outputs=outputs + ) - headers = {'Content-Type': 'application/json'} + headers = {"Content-Type": "application/json"} r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() self.assertEqual( - self.expected_result_, r.json(), + self.expected_result_, + r.json(), "Expected response body: {}; got: {}".format( - self.expected_result_, r.json())) + self.expected_result_, r.json() + ), + ) def test_inference_client_generated_request_binary(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -143,31 +155,36 @@ def test_inference_client_generated_request_binary(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.sagemaker-triton.binary+json;json-header-size={}' - .format(header_length) + "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size={}".format( + header_length + ) } r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() self.assertEqual( - self.expected_result_, r.json(), + self.expected_result_, + r.json(), "Expected response body: {}; got: {}".format( - self.expected_result_, r.json())) + self.expected_result_, r.json() + ), + ) def test_inference_client_generated_response(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -175,22 +192,20 @@ def test_inference_client_generated_response(self): inputs[0].set_data_from_numpy(input_data, binary_data=False) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) request_body, _ = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + inputs, outputs=outputs + ) - headers = {'Content-Type': 'application/json'} + headers = {"Content-Type": "application/json"} r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() - result = httpclient.InferenceServerClient.parse_response_body( - r._content) + result = httpclient.InferenceServerClient.parse_response_body(r._content) - output0_data = result.as_numpy('OUTPUT0') - output1_data = result.as_numpy('OUTPUT1') + output0_data = result.as_numpy("OUTPUT0") + output1_data = result.as_numpy("OUTPUT1") for i in range(16): self.assertEqual(output0_data[0][i], self.expected_output0_data_[i]) self.assertEqual(output1_data[0][i], self.expected_output1_data_[i]) @@ -198,8 +213,8 @@ def test_inference_client_generated_response(self): def test_inference_client_generated_response_binary(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -207,25 +222,26 @@ def test_inference_client_generated_response_binary(self): inputs[0].set_data_from_numpy(input_data, binary_data=False) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) request_body, _ = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + inputs, outputs=outputs + ) - headers = {'Content-Type': 'application/json'} + headers = {"Content-Type": "application/json"} r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() - header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size=" - header_length_str = r.headers['Content-Type'][len(header_length_prefix - ):] + header_length_prefix = ( + "application/vnd.sagemaker-triton.binary+json;json-header-size=" + ) + header_length_str = r.headers["Content-Type"][len(header_length_prefix) :] result = httpclient.InferenceServerClient.parse_response_body( - r._content, header_length=int(header_length_str)) + r._content, header_length=int(header_length_str) + ) - output0_data = result.as_numpy('OUTPUT0') - output1_data = result.as_numpy('OUTPUT1') + output0_data = result.as_numpy("OUTPUT0") + output1_data = result.as_numpy("OUTPUT1") for i in range(16): self.assertEqual(output0_data[0][i], self.expected_output0_data_[i]) self.assertEqual(output1_data[0][i], self.expected_output1_data_[i]) @@ -233,8 +249,8 @@ def test_inference_client_generated_response_binary(self): def test_malformed_binary_header(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -242,29 +258,34 @@ def test_malformed_binary_header(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'additional-string/application/vnd.sagemaker-triton.binary+json;json-header-size={}' - .format(header_length) + "Content-Type": "additional-string/application/vnd.sagemaker-triton.binary+json;json-header-size={}".format( + header_length + ) } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) def test_malformed_binary_header_not_number(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -272,29 +293,34 @@ def test_malformed_binary_header_not_number(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.sagemaker-triton.binary+json;json-header-size=additional-string{}' - .format(header_length) + "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size=additional-string{}".format( + header_length + ) } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) def test_malformed_binary_header_negative_number(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -302,28 +328,32 @@ def test_malformed_binary_header_negative_number(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.sagemaker-triton.binary+json;json-header-size=-123' + "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size=-123" } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) def test_malformed_binary_header_large_number(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -331,23 +361,27 @@ def test_malformed_binary_header_large_number(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.sagemaker-triton.binary+json;json-header-size=12345' + "Content-Type": "application/vnd.sagemaker-triton.binary+json;json-header-size=12345" } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_sagemaker/test.sh b/qa/L0_sagemaker/test.sh index e701e8dd71..b5bd07c519 100755 --- a/qa/L0_sagemaker/test.sh +++ b/qa/L0_sagemaker/test.sh @@ -56,11 +56,12 @@ rm -f *.out SAGEMAKER_TEST=sagemaker_test.py SAGEMAKER_MULTI_MODEL_TEST=sagemaker_multi_model_test.py -MULTI_MODEL_UNIT_TEST_COUNT=6 +MULTI_MODEL_UNIT_TEST_COUNT=7 UNIT_TEST_COUNT=9 CLIENT_LOG="./client.log" DATADIR=/data/inferenceserver/${REPO_VERSION} +ENSEMBLEDIR=/data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_model_repository SERVER=/opt/tritonserver/bin/tritonserver SERVER_LOG="./server.log" # Link model repository to "/opt/ml/model" @@ -352,7 +353,7 @@ if [ "$SERVER_PID" == "0" ]; then exit 1 fi -# Ping and expect error code +# Ping and expect error code in SME mode. set +e code=`curl -s -w %{http_code} -o ./ping.out localhost:8080/ping` set -e @@ -382,6 +383,33 @@ cp -r $DATADIR/qa_model_repository/onnx_int32_int32_int32/* ${MODEL1_PATH} && \ cp -r $DATADIR/qa_identity_model_repository/onnx_zero_1_float32/* ${MODEL2_PATH} && \ sed -i "s/onnx_zero_1_float32/sm_mme_model_2/" ${MODEL2_PATH}/config.pbtxt +# Ensemble model +ENSEMBLE_MODEL_PATH="models/123456789ensemble/model" +mkdir -p "${ENSEMBLE_MODEL_PATH}" + +model_name=python_float32_float32_float32 + +mkdir -p ${ENSEMBLE_MODEL_PATH}/${model_name}/1 && \ +cp ../python_models/add_sub/model.py ${ENSEMBLE_MODEL_PATH}/${model_name}/1/. && \ +cp ../python_models/add_sub/config.pbtxt ${ENSEMBLE_MODEL_PATH}/${model_name}/. +(cd ${ENSEMBLE_MODEL_PATH}/${model_name} && \ + sed -i "s/label_filename:.*//" config.pbtxt && \ + echo "max_batch_size: 64" >> config.pbtxt) + +# Ensemble part +mkdir -p ${ENSEMBLE_MODEL_PATH}/fan_${model_name}/1 && \ + cp ../python_models/add_sub/model.py ${ENSEMBLE_MODEL_PATH}/fan_${model_name}/1/. && \ + cp ../python_models/fan_add_sub/config.pbtxt ${ENSEMBLE_MODEL_PATH}/fan_${model_name}/. && \ + (cd ${ENSEMBLE_MODEL_PATH}/fan_${model_name} && \ + sed -i "s/label_filename:.*//" config.pbtxt && \ + sed -i "s/model_name: \"ENSEMBLE_MODEL_NAME\"/model_name: \"${model_name}\"/" config.pbtxt && \ + sed -i "0,/name:.*/{s/name:.*/name: \"fan_${model_name}\"/}" config.pbtxt && \ + echo "max_batch_size: 64" >> config.pbtxt) + +# # custom float32 component of ensemble +cp -r $ENSEMBLEDIR/nop_TYPE_FP32_-1 ${ENSEMBLE_MODEL_PATH}/. && \ + mkdir -p ${ENSEMBLE_MODEL_PATH}/nop_TYPE_FP32_-1/1 + # Start server with 'serve' script export SAGEMAKER_MULTI_MODEL=true export SAGEMAKER_TRITON_LOG_VERBOSE=true @@ -423,10 +451,8 @@ rm -rf /opt/ml/models kill $SERVER_PID wait $SERVE_PID - # MME end - unlink /opt/ml/model rm -rf /opt/ml/model diff --git a/qa/L0_savedmodel_shape/saved_model_shape_test.py b/qa/L0_savedmodel_shape/saved_model_shape_test.py old mode 100644 new mode 100755 index c1c74c97a7..b5ae13a680 --- a/qa/L0_savedmodel_shape/saved_model_shape_test.py +++ b/qa/L0_savedmodel_shape/saved_model_shape_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,198 +27,202 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range -from future.utils import iteritems import unittest -import numpy as np + import infer_util as iu +import numpy as np import test_util as tu -import os np_dtype_string = np.dtype(object) class SavedModelShapeTest(tu.TestResultCollector): - - def _full_exact(self, input_dtype, output0_dtype, output1_dtype, - output0_raw, output1_raw, swap): - - def _infer_exact_helper(tester, - pf, - tensor_shape, - batch_size, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=True, - output1_raw=True, - model_version=None, - swap=False, - outputs=("OUTPUT0", "OUTPUT1"), - use_http=True, - use_grpc=True, - skip_request_id_check=False, - use_streaming=True, - correlation_id=0): + def _full_exact( + self, input_dtype, output0_dtype, output1_dtype, output0_raw, output1_raw, swap + ): + def _infer_exact_helper( + tester, + pf, + tensor_shape, + batch_size, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=True, + output1_raw=True, + model_version=None, + swap=False, + outputs=("OUTPUT0", "OUTPUT1"), + use_http=True, + use_grpc=True, + skip_request_id_check=False, + use_streaming=True, + correlation_id=0, + ): for bs in (1, batch_size): # model that does not support batching if bs == 1: - iu.infer_exact(tester, - "savedmodel_nobatch", - tensor_shape, - bs, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - model_version=model_version, - swap=swap, - outputs=outputs, - use_http=use_http, - use_grpc=use_grpc, - skip_request_id_check=skip_request_id_check, - use_streaming=use_streaming, - correlation_id=correlation_id) + iu.infer_exact( + tester, + "savedmodel_nobatch", + tensor_shape, + bs, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + model_version=model_version, + swap=swap, + outputs=outputs, + use_http=use_http, + use_grpc=use_grpc, + skip_request_id_check=skip_request_id_check, + use_streaming=use_streaming, + correlation_id=correlation_id, + ) # model that supports batching - iu.infer_exact(tester, - "savedmodel", (bs,) + tensor_shape, - bs, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - model_version=model_version, - swap=swap, - outputs=outputs, - use_http=use_http, - use_grpc=use_grpc, - skip_request_id_check=skip_request_id_check, - use_streaming=use_streaming, - correlation_id=correlation_id) + iu.infer_exact( + tester, + "savedmodel", + (bs,) + tensor_shape, + bs, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + model_version=model_version, + swap=swap, + outputs=outputs, + use_http=use_http, + use_grpc=use_grpc, + skip_request_id_check=skip_request_id_check, + use_streaming=use_streaming, + correlation_id=correlation_id, + ) input_size = 16 - if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype, - (input_size,), (input_size,), - (input_size,)): - _infer_exact_helper(self, - "savedmodel", (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) + if tu.validate_for_tf_model( + input_dtype, + output0_dtype, + output1_dtype, + (input_size,), + (input_size,), + (input_size,), + ): + _infer_exact_helper( + self, + "savedmodel", + (input_size,), + 8, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + swap=swap, + ) def test_raw_bbb(self): - self._full_exact(np.int8, - np.int8, - np.int8, - output0_raw=True, - output1_raw=True, - swap=True) + self._full_exact( + np.int8, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=True + ) def test_raw_sss(self): - self._full_exact(np.int16, - np.int16, - np.int16, - output0_raw=True, - output1_raw=True, - swap=True) + self._full_exact( + np.int16, np.int16, np.int16, output0_raw=True, output1_raw=True, swap=True + ) def test_raw_iii(self): - self._full_exact(np.int32, - np.int32, - np.int32, - output0_raw=True, - output1_raw=True, - swap=True) + self._full_exact( + np.int32, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=True + ) def test_raw_lll(self): - self._full_exact(np.int64, - np.int64, - np.int64, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.int64, np.int64, np.int64, output0_raw=True, output1_raw=True, swap=False + ) def test_raw_hhh(self): - self._full_exact(np.float16, - np.float16, - np.float16, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.float16, + np.float16, + np.float16, + output0_raw=True, + output1_raw=True, + swap=False, + ) def test_raw_fff(self): - self._full_exact(np.float32, - np.float32, - np.float32, - output0_raw=True, - output1_raw=True, - swap=True) + self._full_exact( + np.float32, + np.float32, + np.float32, + output0_raw=True, + output1_raw=True, + swap=True, + ) def test_raw_hff(self): - self._full_exact(np.float16, - np.float32, - np.float32, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.float16, + np.float32, + np.float32, + output0_raw=True, + output1_raw=True, + swap=False, + ) def test_raw_bii(self): - self._full_exact(np.int8, - np.int32, - np.int32, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.int8, np.int32, np.int32, output0_raw=True, output1_raw=True, swap=False + ) def test_raw_ibb(self): - self._full_exact(np.int32, - np.int8, - np.int8, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.int32, np.int8, np.int8, output0_raw=True, output1_raw=True, swap=False + ) def test_raw_ibs(self): - self._full_exact(np.int32, - np.int8, - np.int16, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.int32, np.int8, np.int16, output0_raw=True, output1_raw=True, swap=False + ) def test_raw_iff(self): - self._full_exact(np.int32, - np.float32, - np.float32, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.int32, + np.float32, + np.float32, + output0_raw=True, + output1_raw=True, + swap=False, + ) def test_raw_fii(self): - self._full_exact(np.float32, - np.int32, - np.int32, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.float32, + np.int32, + np.int32, + output0_raw=True, + output1_raw=True, + swap=False, + ) def test_raw_ihs(self): - self._full_exact(np.int32, - np.float16, - np.int16, - output0_raw=True, - output1_raw=True, - swap=False) + self._full_exact( + np.int32, + np.float16, + np.int16, + output0_raw=True, + output1_raw=True, + swap=False, + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_savedmodel_shape/test.sh b/qa/L0_savedmodel_shape/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_scalar_io/scalar_test.py b/qa/L0_scalar_io/scalar_test.py new file mode 100755 index 0000000000..16aa1136ca --- /dev/null +++ b/qa/L0_scalar_io/scalar_test.py @@ -0,0 +1,71 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import os +import unittest + +import numpy as np +import test_util as tu +import tritonclient.grpc as grpcclient +from tritonclient.utils import np_to_triton_dtype + + +class ScalarIOTest(tu.TestResultCollector): + def setUp(self): + self._client = grpcclient.InferenceServerClient(url="localhost:8001") + self._backends = os.environ.get("BACKENDS", "onnx").split(",") + + def _send_request_and_verify_result(self, input, model_name): + inputs = [] + inputs.append( + grpcclient.InferInput("INPUT", input.shape, np_to_triton_dtype(input.dtype)) + ) + inputs[-1].set_data_from_numpy(input) + result = self._client.infer(inputs=inputs, model_name=model_name) + output = result.as_numpy("OUTPUT") + np.testing.assert_allclose(input, output) + + def test_scalar_io(self): + for backend in self._backends: + model_name = f"{backend}_scalar_1dim" + self._send_request_and_verify_result( + np.asarray([1], dtype=np.float32), model_name + ) + + model_name = f"{backend}_scalar_2dim" + self._send_request_and_verify_result( + np.asarray([[1]], dtype=np.float32), model_name + ) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_scalar_io/test.sh b/qa/L0_scalar_io/test.sh new file mode 100755 index 0000000000..ebb9a48d95 --- /dev/null +++ b/qa/L0_scalar_io/test.sh @@ -0,0 +1,93 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +RET=0 +TEST_RESULT_FILE='test_results.txt' +BACKENDS="onnx" +export CUDA_VISIBLE_DEVICES=0 +DATADIR=/data/inferenceserver/${REPO_VERSION} + +rm -rf models +mkdir models +cp -r $DATADIR/qa_scalar_models/* models/ + +CLIENT_LOG="./client.log" +SCALAR_TEST=scalar_test.py +source ../common/util.sh + +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_ARGS="--model-repository=`pwd`/models" +SERVER_LOG="./inference_server.log" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +python3 $SCALAR_TEST >> $CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** scalar_test.py FAILED. \n***" + cat $CLIENT_LOG + cat $SERVER_LOG + RET=1 +fi + +kill $SERVER_PID +wait $SERVER_PID + +# Make sure the server fails loading the model if it has a dimension higher than +# 1 +sed -i "s/dims.*/dims:\[2\]/g" models/onnx_scalar_1dim/config.pbtxt +run_server +if [ "$SERVER_PID" != "0" ]; then + echo -e "\n***\n*** Expected the server to fail loading \n***" + cat $SERVER_LOG + exit 1 +fi + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_sdk/grpc_test.cc b/qa/L0_sdk/grpc_test.cc index 09fe5bbc51..3f45e4ae25 100644 --- a/qa/L0_sdk/grpc_test.cc +++ b/qa/L0_sdk/grpc_test.cc @@ -25,6 +25,7 @@ // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #include + #include "grpc_client.h" namespace tc = triton::client; diff --git a/qa/L0_sdk/http_test.cc b/qa/L0_sdk/http_test.cc index 2c8e231fb2..0b2a4da597 100644 --- a/qa/L0_sdk/http_test.cc +++ b/qa/L0_sdk/http_test.cc @@ -25,6 +25,7 @@ // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #include + #include "http_client.h" namespace tc = triton::client; diff --git a/qa/L0_sdk/test.sh b/qa/L0_sdk/test.sh index 8a52fc05ef..20baf31639 100755 --- a/qa/L0_sdk/test.sh +++ b/qa/L0_sdk/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2019-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -152,7 +152,9 @@ else RET=1 fi -# Check wheels +# Check wheels, note that even TRITON_VERSION is passed as version field for +# wheel generation. The version number will be normalized by setuptools, so +# we need to replace the text here as well to match the normalized version. WHLVERSION=`cat /workspace/TRITON_VERSION | sed 's/dev/\.dev0/'` if [[ "aarch64" != $(uname -m) ]] ; then WHLS="tritonclient-${WHLVERSION}-py3-none-any.whl \ diff --git a/qa/L0_secure_grpc/test.sh b/qa/L0_secure_grpc/test.sh old mode 100644 new mode 100755 index b090258027..784613c6a2 --- a/qa/L0_secure_grpc/test.sh +++ b/qa/L0_secure_grpc/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -42,6 +42,7 @@ export CUDA_VISIBLE_DEVICES=0 RET=0 +TEST_CLIENT_AIO_PY=../clients/simple_grpc_aio_infer_client.py TEST_CLIENT_PY=../clients/simple_grpc_infer_client.py TEST_CLIENT=../clients/simple_grpc_infer_client @@ -102,6 +103,11 @@ for CASE in server mutual both; do cat ${CLIENT_LOG}.${CASE}.ssl_infer RET=1 fi + $TEST_CLIENT_AIO_PY -v --ssl --root-certificates ca.crt --private-key client.key --certificate-chain client.crt >> ${CLIENT_LOG}.${CASE}.ssl_infer.aio 2>&1 + if [ $? -ne 0 ]; then + cat ${CLIENT_LOG}.${CASE}.ssl_infer.aio + RET=1 + fi $TEST_CLIENT -v --ssl --root-certificates ca.crt --private-key client.key --certificate-chain client.crt >> ${CLIENT_LOG}.${CASE}.c++.ssl_infer 2>&1 if [ $? -ne 0 ]; then @@ -140,6 +146,13 @@ for CASE in server mutual; do else RET=1 fi + $TEST_CLIENT_AIO_PY -v >> ${CLIENT_LOG}.${CASE}.no_ssl_fail_infer.aio 2>&1 + if [ $? -ne 0 ]; then + cat ${CLIENT_LOG}.${CASE}.no_ssl_fail_infer.aio + echo -e "\n***\n*** Expected test failure\n***" + else + RET=1 + fi $TEST_CLIENT -v >> ${CLIENT_LOG}.${CASE}.c++.no_ssl_fail_infer 2>&1 if [ $? -ne 0 ]; then @@ -157,6 +170,13 @@ for CASE in server mutual; do else RET=1 fi + $TEST_CLIENT_AIO_PY -v --ssl --root-certificates ca.crt --private-key client2.key --certificate-chain client2.crt >> ${CLIENT_LOG}.${CASE}.wrong_ssl_fail_infer.aio 2>&1 + if [ $? -ne 0 ]; then + cat ${CLIENT_LOG}.${CASE}.wrong_ssl_fail_infer.aio + echo -e "\n***\n*** Expected test failure\n***" + else + RET=1 + fi $TEST_CLIENT -v --ssl --root-certificates ca.crt --private-key client2.key --certificate-chain client2.crt >> ${CLIENT_LOG}.${CASE}.c++.wrong_ssl_fail_infer 2>&1 if [ $? -ne 0 ]; then diff --git a/qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt b/qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt new file mode 100644 index 0000000000..d9be228d5d --- /dev/null +++ b/qa/L0_sequence_batcher/request_timeout_models/custom_sequence_int32_timeout/config.pbtxt @@ -0,0 +1,62 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "identity" +max_batch_size: 1 + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] + +instance_group [ + { + count: 1 + kind : KIND_CPU + } +] + +sequence_batching { + max_sequence_idle_microseconds: 50000000 +} + +parameters [ + { + key: "execute_delay_ms" + value: { string_value: "5000" } + } +] diff --git a/qa/L0_sequence_batcher/sequence_batcher_test.py b/qa/L0_sequence_batcher/sequence_batcher_test.py old mode 100644 new mode 100755 index 0b794ece9c..3e6cfc032a --- a/qa/L0_sequence_batcher/sequence_batcher_test.py +++ b/qa/L0_sequence_batcher/sequence_batcher_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python + +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,22 +30,25 @@ sys.path.append("../common") -from builtins import str import os -import time +import random import threading +import time import unittest +from builtins import str +from functools import partial + import numpy as np -import test_util as tu import sequence_util as su +import test_util as tu +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException -TEST_SYSTEM_SHARED_MEMORY = bool( - int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0))) -TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get('TEST_CUDA_SHARED_MEMORY', - 0))) +TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0))) +TEST_CUDA_SHARED_MEMORY = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0))) -USE_GRPC = (os.environ.get('USE_GRPC', 1) != "0") -USE_HTTP = (os.environ.get('USE_HTTP', 1) != "0") +USE_GRPC = os.environ.get("USE_GRPC", 1) != "0" +USE_HTTP = os.environ.get("USE_HTTP", 1) != "0" assert USE_GRPC or USE_HTTP, "USE_GRPC or USE_HTTP must be non-zero" if USE_GRPC and USE_HTTP: _protocols = ("http", "grpc") @@ -52,27 +57,27 @@ else: _protocols = ("http",) -BACKENDS = os.environ.get('BACKENDS', "graphdef savedmodel onnx plan custom") -ENSEMBLES = bool(int(os.environ.get('ENSEMBLES', 1))) +BACKENDS = os.environ.get("BACKENDS", "graphdef savedmodel onnx plan custom python") +ENSEMBLES = bool(int(os.environ.get("ENSEMBLES", 1))) -NO_BATCHING = (int(os.environ['NO_BATCHING']) == 1) -MODEL_INSTANCES = int(os.environ['MODEL_INSTANCES']) -IMPLICIT_STATE = (int(os.environ['IMPLICIT_STATE']) == 1) +NO_BATCHING = int(os.environ["NO_BATCHING"]) == 1 +MODEL_INSTANCES = int(os.environ["MODEL_INSTANCES"]) +IMPLICIT_STATE = int(os.environ["IMPLICIT_STATE"]) == 1 # Use initial state for implicit state -INITIAL_STATE_FILE = (int(os.environ['INITIAL_STATE_FILE']) == 1) +INITIAL_STATE_FILE = int(os.environ["INITIAL_STATE_FILE"]) == 1 _trials = () if NO_BATCHING: - for backend in BACKENDS.split(' '): - if (backend != "libtorch") and (backend != 'custom'): + for backend in BACKENDS.split(" "): + if backend != "custom": _trials += (backend + "_nobatch",) -elif os.environ['BATCHER_TYPE'] == "VARIABLE": - for backend in BACKENDS.split(' '): - if (backend != "libtorch") and (backend != 'custom'): +elif os.environ["BATCHER_TYPE"] == "VARIABLE": + for backend in BACKENDS.split(" "): + if (backend != "libtorch") and (backend != "custom"): _trials += (backend,) else: - _trials = BACKENDS.split(' ') + _trials = BACKENDS.split(" ") # Add ensemble to the _trials ENSEMBLE_PREFIXES = ["simple_", "sequence_", "fan_"] @@ -94,7 +99,7 @@ # Not all models can be tested for ragged handling because the models # don't deal well with non-size-1 shapes _ragged_batch_not_supported_trials = list() -if os.environ['BATCHER_TYPE'] == "VARIABLE": +if os.environ["BATCHER_TYPE"] == "VARIABLE": if "custom" in _trials: _ragged_batch_not_supported_trials.append("custom") if "plan" in _trials: @@ -115,45 +120,47 @@ def is_ensemble(model_name): class SequenceBatcherTest(su.SequenceBatcherTestUtil): - def get_datatype(self, trial): # Get the datatype to use based on what models are available (see test.sh) - if ("plan" in trial): + if "plan" in trial: return (np.float32,) - if ("custom" in trial): + if "custom" in trial: return (np.int32,) - if ("savedmodel" in trial): + if "savedmodel" in trial: return (np.float32, np.bool_) - if ("graphdef" in trial): + if "graphdef" in trial: return (np.dtype(object), np.bool_) - # Only test the string data type for ONNX models in implicit state + # Only test the string data type for ONNX and libtorch models in implicit state if IMPLICIT_STATE: - if ("onnx" in trial): + if "onnx" in trial: return (np.dtype(object), np.int32, np.bool_) + if NO_BATCHING: + if "libtorch" in trial: + return (np.dtype(object), np.int32, np.bool_) return (np.int32, np.bool_) def get_expected_result(self, expected_result, value, trial, flag_str=None): # Adjust the expected_result for models that - # couldn't implement the full accumulator. See + # could not implement the full accumulator. See # qa/common/gen_qa_sequence_models.py for more # information. - if ((not NO_BATCHING and - ("custom" not in trial)) or ("graphdef" in trial) or - ("plan" in trial) or ("onnx" in trial)) or ("libtorch" in trial): + if ( + (not NO_BATCHING and ("custom" not in trial)) + or ("graphdef" in trial) + or ("plan" in trial) + or ("onnx" in trial) + ) or ("libtorch" in trial): expected_result = value if (flag_str is not None) and ("start" in flag_str): expected_result += 1 return expected_result - def get_expected_result_implicit(self, - expected_result, - value, - trial, - flag_str=None, - dtype=None): - if dtype == np.dtype(object): + def get_expected_result_implicit( + self, expected_result, value, trial, flag_str=None, dtype=None + ): + if dtype == np.dtype(object) and trial.startswith("onnx"): return value if INITIAL_STATE_FILE: @@ -176,7 +183,8 @@ def test_simple_sequence(self): model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( - dtype == np.bool_): + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -185,14 +193,17 @@ def test_simple_sequence(self): self.clear_deferred_exceptions() try: self.check_setup(model_name) - self.assertFalse( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" - in os.environ) - expected_result = self.get_expected_result( - 45, 9, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 45, 9, trial, "end", dtype) + self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertNotIn( + "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ + ) + expected_result = ( + self.get_expected_result(45, 9, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 45, 9, trial, "end", dtype + ) + ) self.check_sequence( trial, @@ -201,19 +212,28 @@ def test_simple_sequence(self): 5, (4000, None), # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay)) - (("start", 1, None, None), (None, 2, None, None), - (None, 3, None, None), (None, 4, None, None), - (None, 5, None, None), (None, 6, None, None), - (None, 7, None, None), (None, 8, None, None), - ("end", 9, None, None)), + ( + ("start", 1, None, None), + (None, 2, None, None), + (None, 3, None, None), + (None, 4, None, None), + (None, 5, None, None), + (None, 6, None, None), + (None, 7, None, None), + (None, 8, None, None), + ("end", 9, None, None), + ), expected_result, protocol, sequence_name="{}_{}".format( - self._testMethodName, protocol)) + self._testMethodName, protocol + ), + ) self.check_deferred_exception() - self.check_status(model_name, {1: 9 * (idx + 1)}, - 9 * (idx + 1), 9 * (idx + 1)) + self.check_status( + model_name, {1: 9 * (idx + 1)}, 9 * (idx + 1), 9 * (idx + 1) + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -229,7 +249,8 @@ def test_length1_sequence(self): model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( - dtype == np.bool_): + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -238,14 +259,17 @@ def test_length1_sequence(self): self.clear_deferred_exceptions() try: self.check_setup(model_name) - self.assertFalse( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" - in os.environ) - expected_result = self.get_expected_result( - 42, 42, trial, "start,end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 42, 42, trial, "start,end", dtype) + self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertNotIn( + "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ + ) + expected_result = ( + self.get_expected_result(42, 42, trial, "start,end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 42, 42, trial, "start,end", dtype + ) + ) self.check_sequence( trial, @@ -254,16 +278,18 @@ def test_length1_sequence(self): 99, (4000, None), # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay)) - ( - ("start,end", 42, None, None),), + (("start,end", 42, None, None),), expected_result, protocol, sequence_name="{}_{}".format( - self._testMethodName, protocol)) + self._testMethodName, protocol + ), + ) self.check_deferred_exception() - self.check_status(model_name, {1: idx + 1}, (idx + 1), - (idx + 1)) + self.check_status( + model_name, {1: idx + 1}, (idx + 1), (idx + 1) + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -285,7 +311,8 @@ def test_batch_size(self): model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( - dtype == np.bool_): + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -294,14 +321,17 @@ def test_batch_size(self): self.clear_deferred_exceptions() try: self.check_setup(model_name) - self.assertFalse( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" - in os.environ) - expected_result = self.get_expected_result( - 10, 9, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 10, 9, trial, "end", dtype) + self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertNotIn( + "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ + ) + expected_result = ( + self.get_expected_result(10, 9, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 10, 9, trial, "end", dtype + ) + ) self.check_sequence( trial, @@ -315,27 +345,36 @@ def test_batch_size(self): protocol, batch_size=2, sequence_name="{}_{}".format( - self._testMethodName, protocol)) + self._testMethodName, protocol + ), + ) self.check_deferred_exception() self.assertTrue(False, "expected error") except Exception as ex: for prefix in ENSEMBLE_PREFIXES: if model_name.startswith(prefix): - base_model_name = model_name[(len(prefix)):] - self.assertTrue(ex.message().startswith( - str("in ensemble '{}', " + - "inference request to model '{}' must specify " - + - "batch-size 1 due to requirements of sequence " - + "batcher").format( - model_name, base_model_name))) + base_model_name = model_name[(len(prefix)) :] + self.assertTrue( + ex.message().startswith( + str( + "in ensemble '{}', " + + "inference request to model '{}' must specify " + + "batch-size 1 due to requirements of sequence " + + "batcher" + ).format(model_name, base_model_name) + ) + ) return - self.assertTrue(ex.message().startswith( - str("inference request to model '{}' must specify " - + - "batch-size 1 due to requirements of sequence " - + "batcher").format(model_name))) + self.assertTrue( + ex.message().startswith( + str( + "inference request to model '{}' must specify " + + "batch-size 1 due to requirements of sequence " + + "batcher" + ).format(model_name) + ) + ) def test_no_correlation_id(self): # Send sequence without correlation ID and check for error. @@ -347,7 +386,8 @@ def test_no_correlation_id(self): model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( - dtype == np.bool_): + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -356,14 +396,17 @@ def test_no_correlation_id(self): self.clear_deferred_exceptions() try: self.check_setup(model_name) - self.assertFalse( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" - in os.environ) - expected_result = self.get_expected_result( - 10, 9, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 10, 9, trial, "end", dtype) + self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertNotIn( + "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ + ) + expected_result = ( + self.get_expected_result(10, 9, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 10, 9, trial, "end", dtype + ) + ) self.check_sequence( trial, @@ -376,25 +419,34 @@ def test_no_correlation_id(self): expected_result, protocol, sequence_name="{}_{}".format( - self._testMethodName, protocol)) + self._testMethodName, protocol + ), + ) self.check_deferred_exception() self.assertTrue(False, "expected error") except Exception as ex: for prefix in ENSEMBLE_PREFIXES: if model_name.startswith(prefix): - base_model_name = model_name[(len(prefix)):] - self.assertTrue(ex.message().startswith( - str("in ensemble '{}', " + - "inference request to model '{}' must specify a " - + "non-zero or non-empty correlation ID" - ).format(model_name, base_model_name))) + base_model_name = model_name[(len(prefix)) :] + self.assertTrue( + ex.message().startswith( + str( + "in ensemble '{}', " + + "inference request to model '{}' must specify a " + + "non-zero or non-empty correlation ID" + ).format(model_name, base_model_name) + ) + ) return - self.assertTrue(ex.message().startswith( - str("inference request to model '{}' must specify a " - + - "non-zero or non-empty correlation ID").format( - model_name))) + self.assertTrue( + ex.message().startswith( + str( + "inference request to model '{}' must specify a " + + "non-zero or non-empty correlation ID" + ).format(model_name) + ) + ) def test_no_sequence_start(self): # Send sequence without start flag for never before seen @@ -407,7 +459,8 @@ def test_no_sequence_start(self): model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( - dtype == np.bool_): + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -416,15 +469,18 @@ def test_no_sequence_start(self): self.clear_deferred_exceptions() try: self.check_setup(model_name) - self.assertFalse( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" - in os.environ) - - expected_result = self.get_expected_result( - 6, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, "end", dtype) + self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertNotIn( + "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ + ) + + expected_result = ( + self.get_expected_result(6, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, "end", dtype + ) + ) self.check_sequence( trial, model_name, @@ -432,12 +488,17 @@ def test_no_sequence_start(self): 37469245, (4000, None), # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay)) - ((None, 1, None, None), (None, 2, None, None), - ("end", 3, None, None)), + ( + (None, 1, None, None), + (None, 2, None, None), + ("end", 3, None, None), + ), expected_result, protocol, sequence_name="{}_{}".format( - self._testMethodName, protocol)) + self._testMethodName, protocol + ), + ) self.check_deferred_exception() self.assertTrue(False, "expected error") @@ -445,20 +506,27 @@ def test_no_sequence_start(self): print(model_name + "-> " + ex.message()) for prefix in ENSEMBLE_PREFIXES: if model_name.startswith(prefix): - base_model_name = model_name[(len(prefix)):] - self.assertTrue(ex.message().startswith( - str("in ensemble '{}', " + - "inference request for sequence 37469245 to " - + - "model '{}' must specify the START flag on the first " - + "request of the sequence").format( - model_name, base_model_name))) + base_model_name = model_name[(len(prefix)) :] + self.assertTrue( + ex.message().startswith( + str( + "in ensemble '{}', " + + "inference request for sequence 37469245 to " + + "model '{}' must specify the START flag on the first " + + "request of the sequence" + ).format(model_name, base_model_name) + ) + ) return - self.assertTrue(ex.message().startswith( - str("inference request for sequence 37469245 to " + - "model '{}' must specify the START flag on the first " - + - "request of the sequence").format(model_name))) + self.assertTrue( + ex.message().startswith( + str( + "inference request for sequence 37469245 to " + + "model '{}' must specify the START flag on the first " + + "request of the sequence" + ).format(model_name) + ) + ) def test_no_sequence_start2(self): # Send sequence without start flag after sending a valid @@ -472,7 +540,8 @@ def test_no_sequence_start2(self): model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( - dtype == np.bool_): + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -481,14 +550,17 @@ def test_no_sequence_start2(self): self.clear_deferred_exceptions() try: self.check_setup(model_name) - self.assertFalse( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" - in os.environ) - expected_result = self.get_expected_result( - 6, 3, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, None, dtype) + self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertNotIn( + "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ + ) + expected_result = ( + self.get_expected_result(6, 3, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, None, dtype + ) + ) self.check_sequence( trial, @@ -497,34 +569,48 @@ def test_no_sequence_start2(self): 3, (4000, None), # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay)) - (("start", 1, None, None), (None, 2, None, None), - ("end", 3, None, None), (None, 55, None, None)), + ( + ("start", 1, None, None), + (None, 2, None, None), + ("end", 3, None, None), + (None, 55, None, None), + ), expected_result, protocol, sequence_name="{}_{}".format( - self._testMethodName, protocol)) + self._testMethodName, protocol + ), + ) - self.check_status(model_name, {1: 3 * (idx + 1)}, - 3 * (idx + 1), 3 * (idx + 1)) + self.check_status( + model_name, {1: 3 * (idx + 1)}, 3 * (idx + 1), 3 * (idx + 1) + ) self.check_deferred_exception() self.assertTrue(False, "expected error") except Exception as ex: for prefix in ENSEMBLE_PREFIXES: if model_name.startswith(prefix): - base_model_name = model_name[(len(prefix)):] - self.assertTrue(ex.message().startswith( - str("in ensemble '{}', " + - "inference request for sequence 3 to model '{}' must " - + - "specify the START flag on the first request of " - + "the sequence").format( - model_name, base_model_name))) + base_model_name = model_name[(len(prefix)) :] + self.assertTrue( + ex.message().startswith( + str( + "in ensemble '{}', " + + "inference request for sequence 3 to model '{}' must " + + "specify the START flag on the first request of " + + "the sequence" + ).format(model_name, base_model_name) + ) + ) return - self.assertTrue(ex.message().startswith( - str("inference request for sequence 3 to model '{}' must " - + - "specify the START flag on the first request of " - + "the sequence").format(model_name))) + self.assertTrue( + ex.message().startswith( + str( + "inference request for sequence 3 to model '{}' must " + + "specify the START flag on the first request of " + + "the sequence" + ).format(model_name) + ) + ) def test_no_sequence_end(self): # Send sequence without end flag. Use same correlation ID to @@ -538,7 +624,8 @@ def test_no_sequence_end(self): model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( - dtype == np.bool_): + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -547,14 +634,17 @@ def test_no_sequence_end(self): self.clear_deferred_exceptions() try: self.check_setup(model_name) - self.assertFalse( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" - in os.environ) - expected_result = self.get_expected_result( - 51, 9, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 51, 9, trial, "end", dtype) + self.assertNotIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertNotIn( + "TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ + ) + expected_result = ( + self.get_expected_result(51, 9, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 51, 9, trial, "end", dtype + ) + ) self.check_sequence( trial, @@ -563,16 +653,23 @@ def test_no_sequence_end(self): 4566, (4000, None), # (flag_str, value, (ls_ms, gt_ms), (pre_delay, post_delay)) - (("start", 1, None, None), (None, 2, None, None), - ("start", 42, None, None), ("end", 9, None, None)), + ( + ("start", 1, None, None), + (None, 2, None, None), + ("start", 42, None, None), + ("end", 9, None, None), + ), expected_result, protocol, sequence_name="{}_{}".format( - self._testMethodName, protocol)) + self._testMethodName, protocol + ), + ) self.check_deferred_exception() - self.check_status(model_name, {1: 4 * (idx + 1)}, - 4 * (idx + 1), 4 * (idx + 1)) + self.check_status( + model_name, {1: 4 * (idx + 1)}, 4 * (idx + 1), 4 * (idx + 1) + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -586,8 +683,9 @@ def test_half_batch(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -596,29 +694,31 @@ def test_half_batch(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3, 4), dtype, 0) + (1, 2, 3, 4), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (0, 9, 5, 13), dtype, 1) + (0, 9, 5, 13), dtype, 1 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 8) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 8) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) - expected_result = self.get_expected_result( - 10, 4, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 10, 4, trial, "end", dtype) + expected_result = ( + self.get_expected_result(10, 4, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 10, 4, trial, "end", dtype + ) + ) threads = [] threads.append( @@ -631,18 +731,25 @@ def test_half_batch(self): 987, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), - (None, 3, None), ("end", 4, None)), + ( + ("start", 1, None), + (None, 2, None), + (None, 3, None), + ("end", 4, None), + ), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 27, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 27, 13, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(27, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 27, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -653,14 +760,18 @@ def test_half_batch(self): 988, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 0, None), (None, 9, None), - (None, 5, None), ("end", 13, None)), + ( + ("start", 0, None), + (None, 9, None), + (None, 5, None), + ("end", 13, None), + ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) for t in threads: t.start() @@ -676,7 +787,9 @@ def test_half_batch(self): self.check_status( model_name, {stats_batch_size: 4 * min(2, MODEL_INSTANCES)}, - exec_cnt, 8) + exec_cnt, + 8, + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) finally: @@ -694,8 +807,9 @@ def test_skip_batch(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -704,34 +818,40 @@ def test_skip_batch(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 3), dtype, 0) + (1, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 13, 14), dtype, 1) + (11, 12, 13, 14), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 113), dtype, 2) + (111, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113, 1114), dtype, 3) + (1111, 1112, 1113, 1114), dtype, 3 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 4, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 4, 3, trial, "end", dtype) + expected_result = ( + self.get_expected_result(4, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 4, 3, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -744,15 +864,18 @@ def test_skip_batch(self): # (flag_str, value, pre_delay_ms) (("start", 1, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 50, 14, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 50, 14, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(50, 14, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 50, 14, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -763,18 +886,25 @@ def test_skip_batch(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - (None, 13, None), ("end", 14, None)), + ( + ("start", 11, None), + (None, 12, None), + (None, 13, None), + ("end", 14, None), + ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 224, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 224, 113, trial, "end", dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(224, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 224, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -787,15 +917,18 @@ def test_skip_batch(self): # (flag_str, value, pre_delay_ms) (("start", 111, None), ("end", 113, None)), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 4450, 1114, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 4450, 1114, trial, "end", dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(4450, 1114, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 4450, 1114, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -806,14 +939,18 @@ def test_skip_batch(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - (None, 1113, None), ("end", 1114, None)), + ( + ("start", 1111, None), + (None, 1112, None), + (None, 1113, None), + ("end", 1114, None), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[1].start() threads[3].start() @@ -858,8 +995,9 @@ def test_full_batch(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -868,33 +1006,39 @@ def test_full_batch(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3), dtype, 0) + (1, 2, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 13), dtype, 1) + (11, 12, 13), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 112, 113), dtype, 2) + (111, 112, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113), dtype, 3) + (1111, 1112, 1113), dtype, 3 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) - - expected_result = self.get_expected_result( - 6, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, "end", dtype) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) + + expected_result = ( + self.get_expected_result(6, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, "end", dtype + ) + ) threads = [] threads.append( threading.Thread( @@ -906,19 +1050,21 @@ def test_full_batch(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), ("end", 3, - None)), + (("start", 1, None), (None, 2, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - - expected_result = self.get_expected_result( - 36, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 36, 13, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + + expected_result = ( + self.get_expected_result(36, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 36, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -929,19 +1075,25 @@ def test_full_batch(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - ("end", 13, None)), + ( + ("start", 11, None), + (None, 12, None), + ("end", 13, None), + ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - - expected_result = self.get_expected_result( - 336, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 336, 113, trial, "end", dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + + expected_result = ( + self.get_expected_result(336, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 336, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -952,18 +1104,24 @@ def test_full_batch(self): 1003, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 111, None), (None, 112, None), - ("end", 113, None)), + ( + ("start", 111, None), + (None, 112, None), + ("end", 113, None), + ), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 3336, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 3336, 1113, trial, "end", dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(3336, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 3336, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -974,14 +1132,17 @@ def test_full_batch(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) for t in threads: t.start() @@ -992,9 +1153,12 @@ def test_full_batch(self): # Requests do not get batched for the ensemble model self.check_status(model_name, {1: 12}, 12, 12) else: - self.check_status(model_name, { - (4 / MODEL_INSTANCES): (3 * MODEL_INSTANCES) - }, 3 * MODEL_INSTANCES, 12) + self.check_status( + model_name, + {(4 / MODEL_INSTANCES): (3 * MODEL_INSTANCES)}, + 3 * MODEL_INSTANCES, + 12, + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) finally: @@ -1021,8 +1185,9 @@ def test_ragged_batch(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -1031,34 +1196,40 @@ def test_ragged_batch(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3), dtype, 0, tensor_shape=(2,)) + (1, 2, 3), dtype, 0, tensor_shape=(2,) + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 13), dtype, 1, tensor_shape=(2,)) + (11, 12, 13), dtype, 1, tensor_shape=(2,) + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 112, 113), dtype, 2, tensor_shape=(1,)) + (111, 112, 113), dtype, 2, tensor_shape=(1,) + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113), dtype, 3, tensor_shape=(3,)) + (1111, 1112, 1113), dtype, 3, tensor_shape=(3,) + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 6 * 2, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, "end", dtype) + expected_result = ( + self.get_expected_result(6 * 2, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1069,20 +1240,24 @@ def test_ragged_batch(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), ("end", 3, - None)), + (("start", 1, None), (None, 2, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), + precreated_shm0_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (2,) - })) - - expected_result = self.get_expected_result( - 36 * 2, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 36, 13, trial, "end", dtype) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (2,), + }, + ) + ) + + expected_result = ( + self.get_expected_result(36 * 2, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 36, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1093,19 +1268,27 @@ def test_ragged_batch(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - ("end", 13, None)), + ( + ("start", 11, None), + (None, 12, None), + ("end", 13, None), + ), expected_result, - precreated_shm1_handles), + precreated_shm1_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (2,) - })) - expected_result = self.get_expected_result( - 336, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 336, 113, trial, "end", dtype) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (2,), + }, + ) + ) + expected_result = ( + self.get_expected_result(336, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 336, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1116,19 +1299,27 @@ def test_ragged_batch(self): 1003, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 111, None), (None, 112, None), - ("end", 113, None)), + ( + ("start", 111, None), + (None, 112, None), + ("end", 113, None), + ), expected_result, - precreated_shm2_handles), + precreated_shm2_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (1,) - })) - expected_result = self.get_expected_result( - 3336 * 3, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 3336, 1113, trial, "end", dtype) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (1,), + }, + ) + ) + expected_result = ( + self.get_expected_result(3336 * 3, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 3336, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1139,15 +1330,20 @@ def test_ragged_batch(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), + precreated_shm3_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (3,) - })) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (3,), + }, + ) + ) threads[0].start() threads[1].start() @@ -1188,8 +1384,9 @@ def test_ragged_batch_allowed(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -1198,34 +1395,40 @@ def test_ragged_batch_allowed(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3), dtype, 0, tensor_shape=(2,)) + (1, 2, 3), dtype, 0, tensor_shape=(2,) + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 13), dtype, 1, tensor_shape=(2,)) + (11, 12, 13), dtype, 1, tensor_shape=(2,) + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 112, 113), dtype, 2, tensor_shape=(1,)) + (111, 112, 113), dtype, 2, tensor_shape=(1,) + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113), dtype, 3, tensor_shape=(3,)) + (1111, 1112, 1113), dtype, 3, tensor_shape=(3,) + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 6 * 2, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6 * 2, 3, trial, "end", dtype) + expected_result = ( + self.get_expected_result(6 * 2, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6 * 2, 3, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1236,20 +1439,24 @@ def test_ragged_batch_allowed(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), ("end", 3, - None)), + (("start", 1, None), (None, 2, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), + precreated_shm0_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (2,) - })) - - expected_result = self.get_expected_result( - 36 * 2, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 36 * 2, 13, trial, "end", dtype) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (2,), + }, + ) + ) + + expected_result = ( + self.get_expected_result(36 * 2, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 36 * 2, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1260,19 +1467,27 @@ def test_ragged_batch_allowed(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - ("end", 13, None)), + ( + ("start", 11, None), + (None, 12, None), + ("end", 13, None), + ), expected_result, - precreated_shm1_handles), + precreated_shm1_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (2,) - })) - expected_result = self.get_expected_result( - 336, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 336, 113, trial, "end", dtype) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (2,), + }, + ) + ) + expected_result = ( + self.get_expected_result(336, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 336, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1283,19 +1498,27 @@ def test_ragged_batch_allowed(self): 1003, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 111, None), (None, 112, None), - ("end", 113, None)), + ( + ("start", 111, None), + (None, 112, None), + ("end", 113, None), + ), expected_result, - precreated_shm2_handles), + precreated_shm2_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (1,) - })) - expected_result = self.get_expected_result( - 3336 * 3, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 3336 * 3, 1113, trial, "end", dtype) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (1,), + }, + ) + ) + expected_result = ( + self.get_expected_result(3336 * 3, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 3336 * 3, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1306,15 +1529,20 @@ def test_ragged_batch_allowed(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), + precreated_shm3_handles, + ), kwargs={ - 'sequence_name': - "{}".format(self._testMethodName), - 'tensor_shape': (3,) - })) + "sequence_name": "{}".format(self._testMethodName), + "tensor_shape": (3,), + }, + ) + ) for t in threads: t.start() @@ -1346,8 +1574,9 @@ def test_backlog(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -1356,36 +1585,43 @@ def test_backlog(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3), dtype, 0) + (1, 2, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 13), dtype, 1) + (11, 12, 13), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 112, 113), dtype, 2) + (111, 112, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113), dtype, 3) + (1111, 1112, 1113), dtype, 3 + ) precreated_shm4_handles = self.precreate_register_regions( - (11111, 11112, 11113), dtype, 4) + (11111, 11112, 11113), dtype, 4 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 6, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, "end", dtype) + expected_result = ( + self.get_expected_result(6, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1396,18 +1632,20 @@ def test_backlog(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), ("end", 3, - None)), + (("start", 1, None), (None, 2, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 36, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 36, 13, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(36, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 36, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1418,18 +1656,24 @@ def test_backlog(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - ("end", 13, None)), + ( + ("start", 11, None), + (None, 12, None), + ("end", 13, None), + ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 336, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 336, 113, trial, "end", dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(336, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 336, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1440,18 +1684,24 @@ def test_backlog(self): 1003, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 111, None), (None, 112, None), - ("end", 113, None)), + ( + ("start", 111, None), + (None, 112, None), + ("end", 113, None), + ), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 3336, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 3336, 1113, trial, "end", dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(3336, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 3336, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1462,19 +1712,25 @@ def test_backlog(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - - expected_result = self.get_expected_result( - 33336, 11113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 33336, 11113, trial, "end", dtype) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + + expected_result = ( + self.get_expected_result(33336, 11113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 33336, 11113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1485,14 +1741,17 @@ def test_backlog(self): 1005, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11111, None), (None, 11112, None), - ("end", 11113, None)), + ( + ("start", 11111, None), + (None, 11112, None), + ("end", 11113, None), + ), expected_result, - precreated_shm4_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm4_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) for t in threads: t.start() @@ -1537,8 +1796,9 @@ def test_backlog_fill(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -1547,38 +1807,46 @@ def test_backlog_fill(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3), dtype, 0) + (1, 2, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 13), dtype, 1) + (11, 13), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 113), dtype, 2) + (111, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113), dtype, 3) + (1111, 1112, 1113), dtype, 3 + ) precreated_shm4_handles = self.precreate_register_regions( - (11111,), dtype, 4) + (11111,), dtype, 4 + ) precreated_shm5_handles = self.precreate_register_regions( - (22222,), dtype, 5) + (22222,), dtype, 5 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 2) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 2 + ) threads = [] - expected_result = self.get_expected_result( - 6, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, "end", dtype) + expected_result = ( + self.get_expected_result(6, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1589,18 +1857,20 @@ def test_backlog_fill(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), ("end", 3, - None)), + (("start", 1, None), (None, 2, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 24, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 24, 13, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(24, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 24, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1613,15 +1883,18 @@ def test_backlog_fill(self): # (flag_str, value, pre_delay_ms) (("start", 11, None), ("end", 13, None)), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 224, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 224, 113, trial, "end", dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(224, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 224, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1634,15 +1907,18 @@ def test_backlog_fill(self): # (flag_str, value, pre_delay_ms) (("start", 111, None), ("end", 113, None)), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 3336, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 3336, 1113, trial, "end", dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(3336, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 3336, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1653,18 +1929,24 @@ def test_backlog_fill(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 11111, 11111, trial, "start,end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 11111, 11111, trial, "start,end", dtype) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(11111, 11111, trial, "start,end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 11111, 11111, trial, "start,end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1675,18 +1957,20 @@ def test_backlog_fill(self): 1005, (None, None), # (flag_str, value, pre_delay_ms) - ( - ("start,end", 11111, None),), + (("start,end", 11111, None),), expected_result, - precreated_shm4_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 22222, 22222, trial, "start,end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 22222, 22222, trial, "start,end", dtype) + precreated_shm4_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(22222, 22222, trial, "start,end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 22222, 22222, trial, "start,end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1697,14 +1981,13 @@ def test_backlog_fill(self): 1006, (None, None), # (flag_str, value, pre_delay_ms) - ( - ("start,end", 22222, None),), + (("start,end", 22222, None),), expected_result, - precreated_shm5_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm5_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() threads[1].start() @@ -1751,8 +2034,9 @@ def test_backlog_fill_no_end(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -1761,38 +2045,46 @@ def test_backlog_fill_no_end(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3), dtype, 0) + (1, 2, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 13), dtype, 1) + (11, 13), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 113), dtype, 2) + (111, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113), dtype, 3) + (1111, 1112, 1113), dtype, 3 + ) precreated_shm4_handles = self.precreate_register_regions( - (11111,), dtype, 4) + (11111,), dtype, 4 + ) precreated_shm5_handles = self.precreate_register_regions( - (22222, 22223, 22224), dtype, 5) + (22222, 22223, 22224), dtype, 5 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 10 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 3) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 3 + ) threads = [] - expected_result = self.get_expected_result( - 6, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, "end", dtype) + expected_result = ( + self.get_expected_result(6, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1803,18 +2095,20 @@ def test_backlog_fill_no_end(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), ("end", 3, - None)), + (("start", 1, None), (None, 2, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 24, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 24, 13, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(24, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 24, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1827,15 +2121,18 @@ def test_backlog_fill_no_end(self): # (flag_str, value, pre_delay_ms) (("start", 11, None), ("end", 13, None)), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 224, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 224, 113, trial, "end", dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(224, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 224, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1848,15 +2145,18 @@ def test_backlog_fill_no_end(self): # (flag_str, value, pre_delay_ms) (("start", 111, None), ("end", 113, None)), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 3336, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 3336, 1113, trial, "end", dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(3336, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 3336, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1867,18 +2167,24 @@ def test_backlog_fill_no_end(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 11111, 11111, trial, "start,end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 11111, 11111, trial, "end", dtype) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(11111, 11111, trial, "start,end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 11111, 11111, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1889,18 +2195,20 @@ def test_backlog_fill_no_end(self): 1005, (None, None), # (flag_str, value, pre_delay_ms) - ( - ("start,end", 11111, None),), + (("start,end", 11111, None),), expected_result, - precreated_shm4_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 66669, 22224, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 66669, 22224, trial, "end", dtype) + precreated_shm4_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(66669, 22224, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 66669, 22224, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -1917,11 +2225,11 @@ def test_backlog_fill_no_end(self): ("end", 22224, 2000), ), expected_result, - precreated_shm5_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm5_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() time.sleep(2) @@ -1967,8 +2275,9 @@ def test_backlog_same_correlation_id(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -1977,36 +2286,43 @@ def test_backlog_same_correlation_id(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 2, 3), dtype, 0) + (1, 2, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 13), dtype, 1) + (11, 12, 13), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 112, 113), dtype, 2) + (111, 112, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113), dtype, 3) + (1111, 1112, 1113), dtype, 3 + ) precreated_shm4_handles = self.precreate_register_regions( - (11111, 11113), dtype, 4) + (11111, 11113), dtype, 4 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 2) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 2 + ) threads = [] - expected_result = self.get_expected_result( - 6, 3, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 6, 3, trial, "end", dtype) + expected_result = ( + self.get_expected_result(6, 3, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 6, 3, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2017,18 +2333,20 @@ def test_backlog_same_correlation_id(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), (None, 2, None), ("end", 3, - None)), + (("start", 1, None), (None, 2, None), ("end", 3, None)), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 36, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 36, 13, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(36, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 36, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2039,18 +2357,24 @@ def test_backlog_same_correlation_id(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - ("end", 13, None)), + ( + ("start", 11, None), + (None, 12, None), + ("end", 13, None), + ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 336, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 336, 113, trial, "end", dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(336, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 336, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2061,18 +2385,24 @@ def test_backlog_same_correlation_id(self): 1003, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 111, None), (None, 112, None), - ("end", 113, None)), + ( + ("start", 111, None), + (None, 112, None), + ("end", 113, None), + ), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 3336, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 3336, 1113, trial, "end", dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(3336, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 3336, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2083,18 +2413,24 @@ def test_backlog_same_correlation_id(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 22224, 11113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 22224, 11113, trial, "end", dtype) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(22224, 11113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 22224, 11113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2107,11 +2443,11 @@ def test_backlog_same_correlation_id(self): # (flag_str, value, pre_delay_ms) (("start", 11111, None), ("end", 11113, None)), expected_result, - precreated_shm4_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm4_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() threads[1].start() @@ -2129,12 +2465,13 @@ def test_backlog_same_correlation_id(self): if MODEL_INSTANCES != 4: batch_exec = { (4 / MODEL_INSTANCES): (3 * MODEL_INSTANCES), - 1: 2 + 1: 2, } else: batch_exec = {1: (3 * MODEL_INSTANCES) + 2} - self.check_status(model_name, batch_exec, - (3 * MODEL_INSTANCES) + 2, 14) + self.check_status( + model_name, batch_exec, (3 * MODEL_INSTANCES) + 2, 14 + ) except Exception as ex: self.assertTrue(False, "unexpected error {}".format(ex)) finally: @@ -2166,8 +2503,9 @@ def test_backlog_same_correlation_id_no_end(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -2176,35 +2514,40 @@ def test_backlog_same_correlation_id_no_end(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 3), dtype, 0) + (1, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 12, 13), dtype, 1) + (11, 12, 12, 13), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 112, 112, 113), dtype, 2) + (111, 112, 112, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1112, 1113), dtype, 3) + (1111, 1112, 1112, 1113), dtype, 3 + ) precreated_shm4_handles = self.precreate_register_regions( - (11111, 11113), dtype, 4) + (11111, 11113), dtype, 4 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for both sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 16) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 16 + ) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 4, 3, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 4, 3, trial, None, dtype) + expected_result = ( + self.get_expected_result(4, 3, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit(4, 3, trial, None, dtype) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2217,15 +2560,18 @@ def test_backlog_same_correlation_id_no_end(self): # (flag_str, value, pre_delay_ms) (("start", 1, None), (None, 3, None)), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 48, 13, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 48, 13, trial, "end", dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(48, 13, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 48, 13, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2236,18 +2582,25 @@ def test_backlog_same_correlation_id_no_end(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - (None, 12, None), ("end", 13, None)), + ( + ("start", 11, None), + (None, 12, None), + (None, 12, None), + ("end", 13, None), + ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 448, 113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 448, 113, trial, "end", dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(448, 113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 448, 113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2258,18 +2611,25 @@ def test_backlog_same_correlation_id_no_end(self): 1003, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 111, None), (None, 112, None), - (None, 112, None), ("end", 113, None)), + ( + ("start", 111, None), + (None, 112, None), + (None, 112, None), + ("end", 113, None), + ), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 4448, 1113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 4448, 1113, trial, "end", dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(4448, 1113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 4448, 1113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2280,18 +2640,25 @@ def test_backlog_same_correlation_id_no_end(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - (None, 1112, None), ("end", 1113, None)), + ( + ("start", 1111, None), + (None, 1112, None), + (None, 1112, None), + ("end", 1113, None), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 22224, 11113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 22224, 11113, trial, "end", dtype) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(22224, 11113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 22224, 11113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2304,11 +2671,11 @@ def test_backlog_same_correlation_id_no_end(self): # (flag_str, value, pre_delay_ms) (("start", 11111, None), ("end", 11113, None)), expected_result, - precreated_shm4_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm4_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() threads[1].start() @@ -2355,8 +2722,9 @@ def test_backlog_sequence_timeout(self): for dtype in dtypes: model_name = tu.get_sequence_model_name(trial, dtype) # Skip bool type ensemble models - if (any(word in trial - for word in ENSEMBLE_PREFIXES)) and (dtype == np.bool_): + if (any(word in trial for word in ENSEMBLE_PREFIXES)) and ( + dtype == np.bool_ + ): continue # For bool type control models, use int32 as I/O types if dtype == np.bool_: @@ -2365,35 +2733,38 @@ def test_backlog_sequence_timeout(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1, 3), dtype, 0) + (1, 3), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 12, 13), dtype, 1) + (11, 12, 12, 13), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 112, 112, 113), dtype, 2) + (111, 112, 112, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1112, 1113), dtype, 3) + (1111, 1112, 1112, 1113), dtype, 3 + ) precreated_shm4_handles = self.precreate_register_regions( - (11111, 11113), dtype, 4) + (11111, 11113), dtype, 4 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain all # inferences for all sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 4) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 4, 3, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 4, 3, trial, None, dtype) + expected_result = ( + self.get_expected_result(4, 3, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit(4, 3, trial, None, dtype) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2404,18 +2775,23 @@ def test_backlog_sequence_timeout(self): 1001, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1, None), - (None, 3, _max_sequence_idle_ms + 1000)), + ( + ("start", 1, None), + (None, 3, _max_sequence_idle_ms + 1000), + ), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 48, 13, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 48, 13, trial, None, dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(48, 13, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 48, 13, trial, None, dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2426,20 +2802,25 @@ def test_backlog_sequence_timeout(self): 1002, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, - None), (None, 12, _max_sequence_idle_ms / 2), - (None, 12, _max_sequence_idle_ms / 2), - ("end", 13, _max_sequence_idle_ms / 2)), + ( + ("start", 11, None), + (None, 12, _max_sequence_idle_ms / 2), + (None, 12, _max_sequence_idle_ms / 2), + ("end", 13, _max_sequence_idle_ms / 2), + ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 448, 113, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 448, 113, trial, None, dtype) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(448, 113, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 448, 113, trial, None, dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2450,20 +2831,25 @@ def test_backlog_sequence_timeout(self): 1003, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 111, - None), (None, 112, _max_sequence_idle_ms / 2), - (None, 112, _max_sequence_idle_ms / 2), - ("end", 113, _max_sequence_idle_ms / 2)), + ( + ("start", 111, None), + (None, 112, _max_sequence_idle_ms / 2), + (None, 112, _max_sequence_idle_ms / 2), + ("end", 113, _max_sequence_idle_ms / 2), + ), expected_result, - precreated_shm2_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 4448, 1113, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 4448, 1113, trial, None, dtype) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(4448, 1113, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 4448, 1113, trial, None, dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2474,20 +2860,25 @@ def test_backlog_sequence_timeout(self): 1004, (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), - (None, 1112, _max_sequence_idle_ms / 2), - (None, 1112, _max_sequence_idle_ms / 2), - ("end", 1113, _max_sequence_idle_ms / 2)), + ( + ("start", 1111, None), + (None, 1112, _max_sequence_idle_ms / 2), + (None, 1112, _max_sequence_idle_ms / 2), + ("end", 1113, _max_sequence_idle_ms / 2), + ), expected_result, - precreated_shm3_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 22224, 11113, trial, "end" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 22224, 11113, trial, "end", dtype) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(22224, 11113, trial, "end") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 22224, 11113, trial, "end", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2500,11 +2891,11 @@ def test_backlog_sequence_timeout(self): # (flag_str, value, pre_delay_ms) (("start", 11111, None), ("end", 11113, None)), expected_result, - precreated_shm4_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm4_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() threads[1].start() @@ -2520,18 +2911,27 @@ def test_backlog_sequence_timeout(self): except Exception as ex: for prefix in ENSEMBLE_PREFIXES: if model_name.startswith(prefix): - base_model_name = model_name[(len(prefix)):] - self.assertTrue(ex.message().startswith( - str("in ensemble '{}', " + - "inference request for sequence 1001 to " + - "model '{}' must specify the START flag on the first " - + "request of the sequence").format( - model_name, base_model_name))) + base_model_name = model_name[(len(prefix)) :] + self.assertTrue( + ex.message().startswith( + str( + "in ensemble '{}', " + + "inference request for sequence 1001 to " + + "model '{}' must specify the START flag on the first " + + "request of the sequence" + ).format(model_name, base_model_name) + ) + ) return - self.assertTrue(ex.message().startswith( - str("inference request for sequence 1001 to " + - "model '{}' must specify the START flag on the first " - + "request of the sequence").format(model_name))) + self.assertTrue( + ex.message().startswith( + str( + "inference request for sequence 1001 to " + + "model '{}' must specify the START flag on the first " + + "request of the sequence" + ).format(model_name) + ) + ) finally: if TEST_SYSTEM_SHARED_MEMORY or TEST_CUDA_SHARED_MEMORY: self.cleanup_shm_regions(precreated_shm0_handles) @@ -2567,28 +2967,30 @@ def test_queue_delay_no_min_util(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1,), dtype, 0) + (1,), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12), dtype, 1) + (11, 12), dtype, 1 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain 2 sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 1, 1, trial, "start" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 1, 1, trial, "start", dtype) + expected_result = ( + self.get_expected_result(1, 1, trial, "start") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 1, 1, trial, "start", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2599,18 +3001,20 @@ def test_queue_delay_no_min_util(self): 1001, (2000, None), # (flag_str, value, pre_delay_ms) - ( - ("start", 1, None),), + (("start", 1, None),), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 23, 12, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 23, 12, trial, None, dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(23, 12, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 23, 12, trial, None, dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2626,11 +3030,11 @@ def test_queue_delay_no_min_util(self): (None, 12, None), ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() time.sleep(1) @@ -2674,28 +3078,30 @@ def test_queue_delay_half_min_util(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1,), dtype, 0) + (1,), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12), dtype, 1) + (11, 12), dtype, 1 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain 2 sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 1, 1, trial, "start" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 1, 1, trial, "start", dtype) + expected_result = ( + self.get_expected_result(1, 1, trial, "start") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 1, 1, trial, "start", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2706,18 +3112,20 @@ def test_queue_delay_half_min_util(self): 1001, (2000, None), # (flag_str, value, pre_delay_ms) - ( - ("start", 1, None),), + (("start", 1, None),), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 23, 12, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 23, 12, trial, None, dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(23, 12, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 23, 12, trial, None, dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2733,11 +3141,11 @@ def test_queue_delay_half_min_util(self): (None, 12, None), ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() time.sleep(1) @@ -2781,28 +3189,30 @@ def test_queue_delay_full_min_util(self): self.clear_deferred_exceptions() precreated_shm0_handles = self.precreate_register_regions( - (1,), dtype, 0) + (1,), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_regions( - (11, 12), dtype, 1) + (11, 12), dtype, 1 + ) try: self.check_setup(model_name) # Need scheduler to wait for queue to contain 2 sequences. - self.assertTrue( - "TRITONSERVER_DELAY_SCHEDULER" in os.environ) + self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 2) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), - 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) threads = [] - expected_result = self.get_expected_result( - 1, 1, trial, "start" - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 1, 1, trial, "start", dtype) + expected_result = ( + self.get_expected_result(1, 1, trial, "start") + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 1, 1, trial, "start", dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2813,18 +3223,20 @@ def test_queue_delay_full_min_util(self): 1001, (4000, 3000), # (flag_str, value, pre_delay_ms) - ( - ("start", 1, None),), + (("start", 1, None),), expected_result, - precreated_shm0_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) - expected_result = self.get_expected_result( - 23, 12, trial, None - ) if not IMPLICIT_STATE else self.get_expected_result_implicit( - 23, 12, trial, None, dtype) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) + expected_result = ( + self.get_expected_result(23, 12, trial, None) + if not IMPLICIT_STATE + else self.get_expected_result_implicit( + 23, 12, trial, None, dtype + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -2840,11 +3252,11 @@ def test_queue_delay_full_min_util(self): (None, 12, 2000), ), expected_result, - precreated_shm1_handles), - kwargs={ - 'sequence_name': - "{}".format(self._testMethodName) - })) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[0].start() time.sleep(1) @@ -2862,5 +3274,345 @@ def test_queue_delay_full_min_util(self): self.cleanup_shm_regions(precreated_shm1_handles) -if __name__ == '__main__': +class SequenceBatcherRequestTimeoutTest(su.SequenceBatcherTestUtil): + def setUp(self): + super(SequenceBatcherRequestTimeoutTest, self).setUp() + # By default, find tritonserver on "localhost", but can be overridden + # with TRITONSERVER_IPADDR envvar + self.server_address_ = ( + os.environ.get("TRITONSERVER_IPADDR", "localhost") + ":8001" + ) + + # Prepare input and expected output based on the model and + # the infer sequence sent for testing. If the test is to be extended + # for different sequence and model, then proper grouping should be added + self.model_name_ = "custom_sequence_int32_timeout" + self.tensor_data_ = np.ones(shape=[1, 1], dtype=np.int32) + self.inputs_ = [grpcclient.InferInput("INPUT0", [1, 1], "INT32")] + self.inputs_[0].set_data_from_numpy(self.tensor_data_) + self.expected_out_seq_ = [ + ("OUTPUT0", self.tensor_data_), + ("OUTPUT0", self.tensor_data_), + ("OUTPUT0", self.tensor_data_), + ] + + def send_sequence_with_timeout( + self, seq_id, callback, timeout_us=3000000, request_pause_sec=0 + ): + with grpcclient.InferenceServerClient(self.server_address_) as triton_client: + triton_client.start_stream(callback=callback) + triton_client.async_stream_infer( + self.model_name_, + self.inputs_, + sequence_id=seq_id, + sequence_start=True, + timeout=timeout_us, + ) + if request_pause_sec != 0: + time.sleep(request_pause_sec) + triton_client.async_stream_infer( + self.model_name_, self.inputs_, sequence_id=seq_id, timeout=timeout_us + ) + if request_pause_sec != 0: + time.sleep(request_pause_sec) + triton_client.async_stream_infer( + self.model_name_, + self.inputs_, + sequence_id=seq_id, + sequence_end=True, + timeout=timeout_us, + ) + + def test_request_timeout(self): + # Test long running model that receives requests with shorter timeout, + # expect the timeout will only be expired on backlog sequence and reject + # all requests of the sequence once expired. + # Sending two sequences while the model can only process one sequence + # at a time. Each model execution takes 5 second and all requests have + # 3 second timeout, so the second sequence will be rejected. + + # correlation ID is 1-index + seq1_res = [] + seq2_res = [] + seq1_callback = lambda result, error: seq1_res.append((result, error)) + seq2_callback = lambda result, error: seq2_res.append((result, error)) + + # send sequence with 1s interval to ensure processing order + threads = [] + threads.append( + threading.Thread( + target=self.send_sequence_with_timeout, args=(1, seq1_callback) + ) + ) + threads.append( + threading.Thread( + target=self.send_sequence_with_timeout, args=(2, seq2_callback) + ) + ) + threads[0].start() + time.sleep(1) + threads[1].start() + for t in threads: + t.join() + + for idx in range(len(seq1_res)): + result, error = seq1_res[idx] + self.assertIsNone( + error, + "Expect successful inference for sequence 1 requests, got error: {}".format( + error + ), + ) + out = result.as_numpy(self.expected_out_seq_[idx][0]) + expected_out = self.expected_out_seq_[idx][1] + np.testing.assert_allclose( + out, + expected_out, + err_msg="Unexpected output tensor: expect {}, got {}".format( + expected_out, out + ), + ) + + for _, error in seq2_res: + self.assertIsNotNone(error, "Expect error for sequence 2 requests") + with self.assertRaisesRegex( + InferenceServerException, + "timeout of the corresponding sequence has been expired", + msg="Unexpected error: {}".format(error), + ): + raise error + + def test_send_request_after_timeout(self): + # Similar to test_request_timeout, but the sequence to be timed out + # will send the last request after the sequence has been timed out, + # and expecting server to return error regarding sending request of + # an untracked sequence + + seq1_res = [] + seq2_res = [] + seq1_callback = lambda result, error: seq1_res.append((result, error)) + seq2_callback = lambda result, error: seq2_res.append((result, error)) + + threads = [] + threads.append( + threading.Thread( + target=self.send_sequence_with_timeout, args=(1, seq1_callback) + ) + ) + # Each request will be sent with a pause, so the third request + # will be sent after the sequence has been timed out + threads.append( + threading.Thread( + target=self.send_sequence_with_timeout, + args=(2, seq2_callback), + kwargs={"request_pause_sec": 2}, + ) + ) + threads[0].start() + time.sleep(1) + threads[1].start() + for t in threads: + t.join() + + # Check error message of the last request and the rest + # separately + for _, error in seq2_res[0:-1]: + self.assertIsNotNone(error, "Expect error for sequence 2 requests") + with self.assertRaisesRegex( + InferenceServerException, + "timeout of the corresponding sequence has been expired", + msg="Unexpected error: {}".format(error), + ): + raise error + _, last_err = seq2_res[-1] + self.assertIsNotNone(last_err, "Expect error for sequence 2 requests") + with self.assertRaisesRegex( + InferenceServerException, + "must specify the START flag on the first request", + msg="Unexpected error: {}".format(last_err), + ): + raise last_err + + +class SequenceBatcherPreserveOrderingTest(su.SequenceBatcherTestUtil): + def setUp(self): + super().setUp() + # By default, find tritonserver on "localhost", but can be overridden + # with TRITONSERVER_IPADDR envvar + self.server_address_ = ( + os.environ.get("TRITONSERVER_IPADDR", "localhost") + ":8001" + ) + + # Prepare input and expected output based on the model and + # the infer sequence sent for testing. If the test is to be extended + # for different sequence and model, then proper grouping should be added + self.model_name_ = "sequence_py" + self.tensor_data_ = np.ones(shape=[1, 1], dtype=np.int32) + self.inputs_ = [grpcclient.InferInput("INPUT0", [1, 1], "INT32")] + self.inputs_[0].set_data_from_numpy(self.tensor_data_) + self.triton_client = grpcclient.InferenceServerClient(self.server_address_) + + # Atomic request ID for multi-threaded inference + self.request_id_lock = threading.Lock() + self.request_id = 1 + + def send_sequence(self, seq_id, seq_id_map, req_id_map): + if seq_id not in seq_id_map: + seq_id_map[seq_id] = [] + + start, middle, end = (True, False), (False, False), (False, True) + # Send sequence with 1 start, 1 middle, and 1 end request + seq_flags = [start, middle, end] + for start_flag, end_flag in seq_flags: + # Introduce random sleep to better interweave requests from different sequences + time.sleep(random.uniform(0.0, 1.0)) + + # Serialize sending requests to ensure ordered request IDs + with self.request_id_lock: + req_id = self.request_id + self.request_id += 1 + + # Store metadata to validate results later + req_id_map[req_id] = seq_id + seq_id_map[seq_id].append(req_id) + + self.triton_client.async_stream_infer( + self.model_name_, + self.inputs_, + sequence_id=seq_id, + sequence_start=start_flag, + sequence_end=end_flag, + timeout=None, + request_id=str(req_id), + ) + + def _test_sequence_ordering(self, preserve_ordering, decoupled): + # 1. Send a few grpc streaming sequence requests to the model. + # 2. With grpc streaming, the model should receive the requests in + # the same order they are sent from client, and the client should + # receive the responses in the same order sent back by the + # model/server. With sequence scheduler, the requests for each sequence should be routed to the same model + # instance, and no two requests from the same sequence should + # get batched together. + # 3. With preserve_ordering=False, we may get the responses back in a different + # order than the requests, but with grpc streaming we should still expect responses for each sequence to be ordered. + # 4. Assert that the sequence values are ordered, and that the response IDs per sequence are ordered + class SequenceResult: + def __init__(self, seq_id, result, request_id): + self.seq_id = seq_id + self.result = result + self.request_id = int(request_id) + + def full_callback(sequence_dict, sequence_list, result, error): + # We expect no model errors for this test + if error: + self.assertTrue(False, error) + + # Gather all the necessary metadata for validation + request_id = int(result.get_response().id) + sequence_id = request_id_map[request_id] + # Overall list of results in the order received, regardless of sequence ID + sequence_list.append(SequenceResult(sequence_id, result, request_id)) + # Ordered results organized by their seq IDs + sequence_dict[sequence_id].append(result) + + # Store ordered list in which responses are received by client + sequence_list = [] + # Store mapping of sequence ID to response results + sequence_dict = {} + # Store mapping of sequence ID to request IDs and vice versa + sequence_id_map = {} + request_id_map = {} + + # Start stream + seq_callback = partial(full_callback, sequence_dict, sequence_list) + self.triton_client.start_stream(callback=seq_callback) + + # Send N sequences concurrently + threads = [] + num_sequences = 10 + for i in range(num_sequences): + # Sequence IDs are 1-indexed + sequence_id = i + 1 + # Add a result list and callback for each sequence + sequence_dict[sequence_id] = [] + threads.append( + threading.Thread( + target=self.send_sequence, + args=(sequence_id, sequence_id_map, request_id_map), + ) + ) + + # Start all sequence threads + for t in threads: + t.start() + + # Wait for threads to return + for t in threads: + t.join() + + # Block until all requests are completed + self.triton_client.stop_stream() + + # Make sure some inferences occurred and metadata was collected + self.assertGreater(len(sequence_dict), 0) + self.assertGreater(len(sequence_list), 0) + + # Validate model results are sorted per sequence ID (model specific logic) + print(f"=== {preserve_ordering=} {decoupled=} ===") + print("Outputs per Sequence:") + for seq_id, sequence in sequence_dict.items(): + seq_outputs = [ + result.as_numpy("OUTPUT0").flatten().tolist() for result in sequence + ] + print(f"{seq_id}: {seq_outputs}") + self.assertEqual(seq_outputs, sorted(seq_outputs)) + + # Validate request/response IDs for each response in a sequence is sorted + # This should be true regardless of preserve_ordering or not + print("Request IDs per Sequence:") + for seq_id in sequence_id_map: + per_seq_request_ids = sequence_id_map[seq_id] + print(f"{seq_id}: {per_seq_request_ids}") + self.assertEqual(per_seq_request_ids, sorted(per_seq_request_ids)) + + # Validate results are sorted in request order if preserve_ordering is True + if preserve_ordering: + request_ids = [s.request_id for s in sequence_list] + print(f"Request IDs overall:\n{request_ids}") + sequence_ids = [s.seq_id for s in sequence_list] + print(f"Sequence IDs overall:\n{sequence_ids}") + self.assertEqual(request_ids, sorted(request_ids)) + + # Assert some dynamic batching of requests was done + stats = self.triton_client.get_inference_statistics( + model_name=self.model_name_, headers={}, as_json=True + ) + model_stats = stats["model_stats"][0] + self.assertEqual(model_stats["name"], self.model_name_) + self.assertLess( + int(model_stats["execution_count"]), int(model_stats["inference_count"]) + ) + + def test_sequence_with_preserve_ordering(self): + self.model_name_ = "seqpy_preserve_ordering_nondecoupled" + self._test_sequence_ordering(preserve_ordering=True, decoupled=False) + + def test_sequence_without_preserve_ordering(self): + self.model_name_ = "seqpy_no_preserve_ordering_nondecoupled" + self._test_sequence_ordering(preserve_ordering=False, decoupled=False) + + # FIXME [DLIS-5280]: This may fail for decoupled models if writes to GRPC + # stream are done out of order in server, so disable test for now. + # def test_sequence_with_preserve_ordering_decoupled(self): + # self.model_name_ = "seqpy_preserve_ordering_decoupled" + # self._test_sequence_ordering(preserve_ordering=True, decoupled=True) + + # FIXME [DLIS-5280] + # def test_sequence_without_preserve_ordering_decoupled(self): + # self.model_name_ = "seqpy_no_preserve_ordering_decoupled" + # self._test_sequence_ordering(preserve_ordering=False, decoupled=True) + + +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_sequence_batcher/test.sh b/qa/L0_sequence_batcher/test.sh index a201dcf7a3..d91b433966 100755 --- a/qa/L0_sequence_batcher/test.sh +++ b/qa/L0_sequence_batcher/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -42,11 +42,21 @@ TEST_RESULT_FILE='test_results.txt' # Must run on a single device or else the TRITONSERVER_DELAY_SCHEDULER # can fail when the requests are distributed to multiple devices. +ldconfig || true + export CUDA_VISIBLE_DEVICES=0 CLIENT_LOG="./client.log" BATCHER_TEST=sequence_batcher_test.py +if [ -z "$TEST_SYSTEM_SHARED_MEMORY" ]; then + TEST_SYSTEM_SHARED_MEMORY="0" +fi + +if [ -z "$TEST_CUDA_SHARED_MEMORY" ]; then + TEST_CUDA_SHARED_MEMORY="0" +fi + if [ -z "$TEST_VALGRIND" ]; then TEST_VALGRIND="0" fi @@ -77,33 +87,43 @@ if [ "$TEST_JETSON" -eq 1 ]; then MODEL_TRIALS="0 v" fi -TF_VERSION=${TF_VERSION:=1} +TF_VERSION=${TF_VERSION:=2} # On windows the paths invoked by the script (running in WSL) must use # /mnt/c when needed but the paths on the tritonserver command-line # must be C:/ style. +WINDOWS=0 if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then MODELDIR=${MODELDIR:=C:/models} DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"} BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends} SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe} export WSLENV=$WSLENV:TRITONSERVER_DELAY_SCHEDULER:TRITONSERVER_BACKLOG_DELAY_SCHEDULER + WINDOWS=1 else MODELDIR=${MODELDIR:=`pwd`} DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"} TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} SERVER=${TRITON_DIR}/bin/tritonserver BACKEND_DIR=${TRITON_DIR}/backends + + # PyTorch on SBSA requires libgomp to be loaded first. See the following + # GitHub issue for more information: + # https://github.com/pytorch/pytorch/issues/2575 + arch=`uname -m` + if [ $arch = "aarch64" ]; then + SERVER_LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libgomp.so.1 + fi fi -SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}" +SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION} --log-verbose=1" source ../common/util.sh RET=0 # If BACKENDS not specified, set to all -BACKENDS=${BACKENDS:="graphdef savedmodel onnx plan libtorch custom"} +BACKENDS=${BACKENDS:="graphdef savedmodel onnx plan libtorch custom python"} export BACKENDS # If MODEL_TRIALS not specified set to 0 1 2 4 v @@ -151,13 +171,17 @@ export INITIAL_STATE_FILE INITIAL_STATE_ZERO=${INITIAL_STATE_ZERO:="0"} export INITIAL_STATE_ZERO +# If USE_SINGLE_BUFFER is not specified, set to 0 +USE_SINGLE_BUFFER=${USE_SINGLE_BUFFER:="0"} +export USE_SINGLE_BUFFER + # Setup non-variable-size model repositories. The same models are in each # repository but they are configured as: # models0 - four instances with non-batching model # models1 - one instance with batch-size 4 # models2 - two instances with batch-size 2 # models4 - four instances with batch-size 1 -rm -fr *.log *.serverlog models{0,1,2,4} queue_delay_models && mkdir models{0,1,2,4} queue_delay_models +rm -fr *.log models{0,1,2,4} queue_delay_models && mkdir models{0,1,2,4} queue_delay_models # Get the datatype to use based on the backend function get_datatype () { @@ -175,10 +199,29 @@ function get_datatype () { if [[ $1 == "onnx" ]]; then dtype="object int32 bool" fi + if [[ $1 == "libtorch" ]]; then + dtype="object int32 bool" + fi fi echo $dtype } +# Modify corresponding onnx config.pbtxt to create python config.pbtxt +function generate_python_models () { + model_path=$1 + dest_dir=$2 + onnx_model=$(echo ${model_path//python/onnx}) + python_model=$(basename $model_path) + mkdir -p $dest_dir/$python_model/1/ + # for emsemble models keep "platform: ensemble" + if [[ "$model_path" == *"ensemble_model"* ]]; then + cat $onnx_model/config.pbtxt | sed 's/onnx/python/g' > $dest_dir/$python_model/config.pbtxt + else + cat $onnx_model/config.pbtxt | sed 's/platform:.*/backend:\ "python"/g' | sed 's/onnx/python/g' > $dest_dir/$python_model/config.pbtxt + cp ../python_models/sequence_int32/model.py $dest_dir/$python_model/1/ + fi +} + if [[ "$INITIAL_STATE_ZERO" == "1" && "$INITIAL_STATE_FILE" == "1" ]]; then echo -e "\n***\n*** 'INITIAL_STATE_ZERO' and 'INITIAL_STATE_FILE' can't be enabled simultaneously. \n***" exit 1 @@ -200,6 +243,7 @@ else fi MODELS="" +PYTHON_MODELS="" for BACKEND in $BACKENDS; do if [[ $BACKEND == "custom" ]]; then MODELS="$MODELS ../custom_models/custom_sequence_int32" @@ -214,7 +258,13 @@ for BACKEND in $BACKENDS; do for DTYPE in $DTYPES; do # We don't generate ensemble models for bool data type. if [[ $DTYPE != "bool" ]]; then - MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_sequence_${DTYPE}" + if [ "$BACKEND" == "python" ]; then + PYTHON_MODELS="$DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_onnx_sequence_${DTYPE}" + TMP=$(echo $PYTHON_MODELS) + MODELS="$MODELS ${TMP//onnx/python}" + else + MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_sequence_${DTYPE}" + fi fi done fi @@ -229,28 +279,57 @@ fi for MODEL in $MODELS; do if [[ ! "$TEST_VALGRIND" -eq 1 ]]; then - cp -r $MODEL models1/. && \ + # Skip libtorch string models + if [[ "$MODEL" =~ .*"libtorch".*"object".* ]]; then + continue + fi + if [[ "$MODEL" =~ .*"python".* ]]; then + generate_python_models "$MODEL" "models1" + else + cp -r $MODEL models1/. + fi (cd models1/$(basename $MODEL) && \ sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \ sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \ sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt) - cp -r $MODEL models2/. && \ + + # Skip libtorch string models + if [[ "$MODEL" =~ .*"libtorch".*"object".* ]]; then + continue + fi + + if [[ "$MODEL" =~ .*"python".* ]]; then + generate_python_models "$MODEL" "models2" + else + cp -r $MODEL models2/. + fi (cd models2/$(basename $MODEL) && \ sed -i "s/^max_batch_size:.*/max_batch_size: 2/" config.pbtxt && \ sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 2/" config.pbtxt && \ sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 2/" config.pbtxt) - cp -r $MODEL models4/. && \ + + if [[ "$MODEL" =~ .*"python".* ]]; then + generate_python_models "$MODEL" "models4" + else + cp -r $MODEL models4/. + fi (cd models4/$(basename $MODEL) && \ sed -i "s/^max_batch_size:.*/max_batch_size: 1/" config.pbtxt && \ sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 4/" config.pbtxt && \ sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 4/" config.pbtxt) + # Duplicate the models for different delay settings - cp -r $MODEL queue_delay_models/. && \ + if [[ "$MODEL" =~ .*"python".* ]]; then + generate_python_models "$MODEL" "queue_delay_models" + else + cp -r $MODEL queue_delay_models/. + fi (cd queue_delay_models/$(basename $MODEL) && \ sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \ sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \ sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt && \ sed -i "s/sequence_batching {/sequence_batching {\\ndirect {\\nmax_queue_delay_microseconds: 3000000\\nminimum_slot_utilization: 0\\n}/" config.pbtxt) + cp -r queue_delay_models/$(basename $MODEL) queue_delay_models/$(basename $MODEL)_half && \ (cd queue_delay_models/$(basename $MODEL)_half && \ sed -i "s/$(basename $MODEL)/$(basename $MODEL)_half/" config.pbtxt && \ @@ -259,6 +338,23 @@ for MODEL in $MODELS; do (cd queue_delay_models/$(basename $MODEL)_full && \ sed -i "s/$(basename $MODEL)/$(basename $MODEL)_full/" config.pbtxt && \ sed -i "s/minimum_slot_utilization: 0/minimum_slot_utilization: 1/" config.pbtxt) + + # TODO: Enable single state buffer testing for sequence batcher + # if [ "$USE_SINGLE_BUFFER" == "1" && "$IMPLICIT_STATE" == "1" ]; then + # SED_REPLACE_PATTERN="N;N;N;N;N;/state.*dims:.*/a use_single_buffer: true" + # (cd models0/$(basename $MODEL) && \ + # sed -i "$SED_REPLACE_PATTERN" config.pbtxt) + # (cd models1/$(basename $MODEL) && \ + # sed -i "$SED_REPLACE_PATTERN" config.pbtxt) + # (cd models2/$(basename $MODEL) && \ + # sed -i "$SED_REPLACE_PATTERN" config.pbtxt) + # (cd models4/$(basename $MODEL) && \ + # sed -i "$SED_REPLACE_PATTERN" config.pbtxt) + # (cd queue_delay_models/$(basename $MODEL)_full && \ + # sed -i "$SED_REPLACE_PATTERN" config.pbtxt) + # (cd queue_delay_models/$(basename $MODEL)_half && \ + # sed -i "$SED_REPLACE_PATTERN" config.pbtxt) + # fi else cp -r $MODEL queue_delay_models/$(basename $MODEL)_full && \ (cd queue_delay_models/$(basename $MODEL)_full && \ @@ -307,6 +403,7 @@ if [ "$INITIAL_STATE_FILE" == "1" ]; then fi MODELS="" +PYTHON_MODELS="" for BACKEND in $BACKENDS; do if [[ $BACKEND == "custom" ]]; then MODELS="$MODELS ../custom_models/custom_sequence_int32" @@ -320,7 +417,13 @@ for BACKEND in $BACKENDS; do for DTYPE in $DTYPES; do # We don't generate ensemble models for bool data type. if [[ $DTYPE != "bool" ]]; then - MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_nobatch_sequence_${DTYPE}" + if [ "$BACKEND" == "python" ]; then + PYTHON_MODELS="$DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_onnx_nobatch_sequence_${DTYPE}" + TMP=$(echo $PYTHON_MODELS) + MODELS="$MODELS ${TMP//onnx/python}" + else + MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_${BACKEND}_nobatch_sequence_${DTYPE}" + fi fi done @@ -329,22 +432,27 @@ for BACKEND in $BACKENDS; do done for MODEL in $MODELS; do - cp -r $MODEL models0/. && \ - (cd models0/$(basename $MODEL) && \ - sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 4/" config.pbtxt && \ - sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 4/" config.pbtxt) - - if [ "$INITIAL_STATE_FILE" == "1" ]; then - mkdir -p models0/$(basename $MODEL)/initial_state/ && cp input_state_data models0/$(basename $MODEL)/initial_state/ && \ - (cd models0/$(basename $MODEL) && \ - sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt) - fi + if [[ "$MODEL" =~ .*"python".* ]]; then + generate_python_models "$MODEL" "models0" + else + cp -r $MODEL models0/. + fi + (cd models0/$(basename $MODEL) && \ + sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 4/" config.pbtxt && \ + sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 4/" config.pbtxt) + + if [ "$INITIAL_STATE_FILE" == "1" ]; then + mkdir -p models0/$(basename $MODEL)/initial_state/ && cp input_state_data models0/$(basename $MODEL)/initial_state/ && \ + (cd models0/$(basename $MODEL) && \ + sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt) + fi done # modelsv - one instance with batch-size 4 rm -fr modelsv && mkdir modelsv MODELS="" +PYTHON_MODELS="" for BACKEND in $BACKENDS; do if [[ $BACKEND == "custom" ]]; then MODELS="$MODELS ../custom_models/custom_sequence_int32" @@ -358,7 +466,13 @@ for BACKEND in $BACKENDS; do for DTYPE in $DTYPES; do # We don't generate ensemble models for bool data type. if [[ $DTYPE != "bool" ]]; then - MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/${VAR_MODEL_REPOSITORY}/*_${BACKEND}_sequence_${DTYPE}" + if [ "$BACKEND" == "python" ]; then + PYTHON_MODELS="$DATADIR/qa_ensemble_model_repository/$FIXED_MODEL_REPOSITORY/*_onnx_sequence_${DTYPE}" + TMP=$(echo $PYTHON_MODELS) + MODELS="$MODELS ${TMP//onnx/python}" + else + MODELS="$MODELS $DATADIR/qa_ensemble_model_repository/${VAR_MODEL_REPOSITORY}/*_${BACKEND}_sequence_${DTYPE}" + fi fi done fi @@ -366,17 +480,25 @@ for BACKEND in $BACKENDS; do done for MODEL in $MODELS; do - cp -r $MODEL modelsv/. && \ - (cd modelsv/$(basename $MODEL) && \ - sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \ - sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \ - sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt) - - if [ "$INITIAL_STATE_FILE" == "1" ]; then - mkdir -p modelsv/$(basename $MODEL)/initial_state/ && cp input_state_data modelsv/$(basename $MODEL)/initial_state/ && \ - (cd modelsv/$(basename $MODEL) && \ - sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt) - fi + # Skip libtorch string models + if [[ "$MODEL" =~ .*"libtorch".*"object".* ]]; then + continue + fi + if [[ "$MODEL" =~ .*"python".* ]]; then + generate_python_models "$MODEL" "modelsv" + else + cp -r $MODEL modelsv/. + fi + (cd modelsv/$(basename $MODEL) && \ + sed -i "s/^max_batch_size:.*/max_batch_size: 4/" config.pbtxt && \ + sed -i "s/kind: KIND_GPU/kind: KIND_GPU\\ncount: 1/" config.pbtxt && \ + sed -i "s/kind: KIND_CPU/kind: KIND_CPU\\ncount: 1/" config.pbtxt) + + if [ "$INITIAL_STATE_FILE" == "1" ]; then + mkdir -p modelsv/$(basename $MODEL)/initial_state/ && cp input_state_data modelsv/$(basename $MODEL)/initial_state/ && \ + (cd modelsv/$(basename $MODEL) && \ + sed -i "s/zero_data.*/data_file:\"input_state_data\"/" config.pbtxt) + fi done # Same test work on all models since they all have same total number @@ -408,7 +530,7 @@ for model_trial in $MODEL_TRIALS; do for i in $NO_DELAY_TESTS; do SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.$MODEL_PATH.serverlog" + SERVER_LOG="./$i.$MODEL_PATH.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log" @@ -468,7 +590,7 @@ for model_trial in $MODEL_TRIALS; do [[ "$i" != "test_half_batch" ]] && export TRITONSERVER_DELAY_SCHEDULER=4 && [[ "$i" != "test_backlog_sequence_timeout" ]] && export TRITONSERVER_DELAY_SCHEDULER=12 SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.$MODEL_PATH.serverlog" + SERVER_LOG="./$i.$MODEL_PATH.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log" @@ -538,7 +660,7 @@ if [[ $BACKENDS == *"custom"* ]]; then export TRITONSERVER_DELAY_SCHEDULER=12 SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.$MODEL_PATH.serverlog" + SERVER_LOG="./$i.$MODEL_PATH.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log" @@ -596,7 +718,7 @@ for i in $QUEUE_DELAY_TESTS ; do export TRITONSERVER_BACKLOG_DELAY_SCHEDULER=0 export TRITONSERVER_DELAY_SCHEDULER=2 SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.$MODEL_PATH.serverlog" + SERVER_LOG="./$i.$MODEL_PATH.server.log" if [ "$TEST_VALGRIND" -eq 1 ]; then LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log" @@ -644,6 +766,144 @@ for i in $QUEUE_DELAY_TESTS ; do set -e done +# Test request timeout with sequence batcher +# only run the test outside shared memory setting as +# shared memory feature is irrelevant +if [ "$TEST_SYSTEM_SHARED_MEMORY" -ne 1 ] && [ "$TEST_CUDA_SHARED_MEMORY" -ne 1 ]; then + export NO_BATCHING=0 + export MODEL_INSTANCES=1 + export BATCHER_TYPE="FIXED" + + TEST_CASE=SequenceBatcherRequestTimeoutTest + MODEL_PATH=request_timeout_models + mkdir -p ${MODEL_PATH}/custom_sequence_int32_timeout/1 + + SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" + SERVER_LOG="./$TEST_CASE.$MODEL_PATH.server.log" + + if [ "$TEST_VALGRIND" -eq 1 ]; then + LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log" + LEAKCHECK_ARGS="$LEAKCHECK_ARGS_BASE --log-file=$LEAKCHECK_LOG" + run_server_leakcheck + else + run_server + fi + + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + echo "Test: $TEST_CASE, repository $MODEL_PATH" >>$CLIENT_LOG + + set +e + python3 $BATCHER_TEST $TEST_CASE >>$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Test $TEST_CASE Failed\n***" >>$CLIENT_LOG + echo -e "\n***\n*** Test $TEST_CASE Failed\n***" + RET=1 + else + check_test_results $TEST_RESULT_FILE 2 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi + set -e + + kill_server + + set +e + if [ "$TEST_VALGRIND" -eq 1 ]; then + python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG + if [ $? -ne 0 ]; then + RET=1 + fi + fi + set -e +fi + +### Start Preserve Ordering Tests ### + +# Test only supported on windows currently due to use of python backend models +if [ ${WINDOWS} -ne 1 ]; then + # Test preserve ordering true/false and decoupled/non-decoupled + TEST_CASE=SequenceBatcherPreserveOrderingTest + MODEL_PATH=preserve_ordering_models + BASE_MODEL="../python_models/sequence_py" + rm -rf ${MODEL_PATH} + + # FIXME [DLIS-5280]: This may fail for decoupled models if writes to GRPC + # stream are done out of order in server, so decoupled tests are disabled. + MODES="decoupled nondecoupled" + for mode in $MODES; do + NO_PRESERVE="${MODEL_PATH}/seqpy_no_preserve_ordering_${mode}" + mkdir -p ${NO_PRESERVE}/1 + cp ${BASE_MODEL}/config.pbtxt ${NO_PRESERVE} + cp ${BASE_MODEL}/model.py ${NO_PRESERVE}/1 + + PRESERVE="${MODEL_PATH}/seqpy_preserve_ordering_${mode}" + cp -r ${NO_PRESERVE} ${PRESERVE} + sed -i "s/^preserve_ordering: False/preserve_ordering: True/" ${PRESERVE}/config.pbtxt + + if [ ${mode} == "decoupled" ]; then + echo -e "\nmodel_transaction_policy { decoupled: true }" >> ${NO_PRESERVE}/config.pbtxt + echo -e "\nmodel_transaction_policy { decoupled: true }" >> ${PRESERVE}/config.pbtxt + fi + done + + SERVER_ARGS="--model-repository=$MODELDIR/$MODEL_PATH ${SERVER_ARGS_EXTRA}" + SERVER_LOG="./$TEST_CASE.$MODEL_PATH.server.log" + + if [ "$TEST_VALGRIND" -eq 1 ]; then + LEAKCHECK_LOG="./$i.$MODEL_PATH.valgrind.log" + LEAKCHECK_ARGS="$LEAKCHECK_ARGS_BASE --log-file=$LEAKCHECK_LOG" + run_server_leakcheck + else + run_server + fi + + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + echo "Test: $TEST_CASE, repository $MODEL_PATH" >>$CLIENT_LOG + + set +e + python3 $BATCHER_TEST $TEST_CASE >>$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + echo -e "\n***\n*** Test $TEST_CASE Failed\n***" >>$CLIENT_LOG + echo -e "\n***\n*** Test $TEST_CASE Failed\n***" + RET=1 + else + # 2 for preserve_ordering = True/False + check_test_results $TEST_RESULT_FILE 2 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi + set -e + + kill_server + + set +e + if [ "$TEST_VALGRIND" -eq 1 ]; then + python3 ../common/check_valgrind_log.py -f $LEAKCHECK_LOG + if [ $? -ne 0 ]; then + RET=1 + fi + fi + set -e +fi + +### End Preserve Ordering Tests ### + if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else diff --git a/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py b/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py old mode 100644 new mode 100755 index d992b75246..15f16da352 --- a/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py +++ b/qa/L0_sequence_corrid_batcher/sequence_corrid_batcher_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,27 +27,26 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") import os -import time import threading +import time import unittest + import numpy as np -import test_util as tu import sequence_util as su +import test_util as tu -_test_system_shared_memory = bool( - int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0))) -_test_cuda_shared_memory = bool( - int(os.environ.get('TEST_CUDA_SHARED_MEMORY', 0))) +_test_system_shared_memory = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0))) +_test_cuda_shared_memory = bool(int(os.environ.get("TEST_CUDA_SHARED_MEMORY", 0))) -_no_batching = (int(os.environ['NO_BATCHING']) == 1) -_model_instances = int(os.environ['MODEL_INSTANCES']) +_no_batching = int(os.environ["NO_BATCHING"]) == 1 +_model_instances = int(os.environ["MODEL_INSTANCES"]) if _no_batching: - _trials = ("savedmodel_nobatch", "graphdef_nobatch", "plan_nobatch", - "onnx_nobatch") + _trials = ("savedmodel_nobatch", "graphdef_nobatch", "plan_nobatch", "onnx_nobatch") else: _trials = ("savedmodel", "graphdef", "plan", "onnx") @@ -54,23 +55,20 @@ class SequenceCorrIDBatcherTest(su.SequenceBatcherTestUtil): - def get_datatype(self, trial): return np.int32 - def get_expected_result(self, - expected_result, - corrid, - value, - trial, - flag_str=None): + def get_expected_result(self, expected_result, corrid, value, trial, flag_str=None): # Adjust the expected_result for models that - # couldn't implement the full accumulator. See + # could not implement the full accumulator. See # qa/common/gen_qa_dyna_sequence_models.py for more # information. - if ((("nobatch" not in trial) and ("custom" not in trial)) or \ - ("graphdef" in trial) or ("plan" in trial) or \ - ("onnx" in trial)) or ("libtorch" in trial): + if ( + (("nobatch" not in trial) and ("custom" not in trial)) + or ("graphdef" in trial) + or ("plan" in trial) + or ("onnx" in trial) + ) or ("libtorch" in trial): expected_result = value if flag_str is not None: if "start" in flag_str: @@ -87,14 +85,16 @@ def test_skip_batch(self): for trial in _trials: self.clear_deferred_exceptions() dtype = self.get_datatype(trial) - precreated_shm0_handles = self.precreate_register_regions((1, 3), - dtype, 0) + precreated_shm0_handles = self.precreate_register_regions((1, 3), dtype, 0) precreated_shm1_handles = self.precreate_register_regions( - (11, 12, 13, 14), dtype, 1) + (11, 12, 13, 14), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_regions( - (111, 113), dtype, 2) + (111, 113), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_regions( - (1111, 1112, 1113, 1114), dtype, 3) + (1111, 1112, 1113, 1114), dtype, 3 + ) try: model_name = tu.get_dyna_sequence_model_name(trial, dtype) @@ -103,12 +103,11 @@ def test_skip_batch(self): # Need scheduler to wait for queue to contain all # inferences for both sequences. self.assertIn("TRITONSERVER_DELAY_SCHEDULER", os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) + self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", os.environ) self.assertEqual( - int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) - self.assertIn("TRITONSERVER_BACKLOG_DELAY_SCHEDULER", - os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0) + int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0 + ) corrids = [1001, 1002, 1003, 1004] threads = [] @@ -123,12 +122,14 @@ def test_skip_batch(self): (None, None), # (flag_str, value, pre_delay_ms) (("start", 1, None), ("end", 3, None)), - self.get_expected_result(4 + corrids[0], corrids[0], - 3, trial, "end"), - precreated_shm0_handles), - kwargs={ - 'sequence_name': "{}".format(self._testMethodName) - })) + self.get_expected_result( + 4 + corrids[0], corrids[0], 3, trial, "end" + ), + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -139,15 +140,20 @@ def test_skip_batch(self): corrids[1], (None, None), # (flag_str, value, pre_delay_ms) - (("start", 11, None), (None, 12, None), - (None, 13, None), ("end", 14, None)), - self.get_expected_result(50 + corrids[1], - corrids[1], 14, trial, - "end"), - precreated_shm1_handles), - kwargs={ - 'sequence_name': "{}".format(self._testMethodName) - })) + ( + ("start", 11, None), + (None, 12, None), + (None, 13, None), + ("end", 14, None), + ), + self.get_expected_result( + 50 + corrids[1], corrids[1], 14, trial, "end" + ), + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -159,13 +165,14 @@ def test_skip_batch(self): (None, None), # (flag_str, value, pre_delay_ms) (("start", 111, None), ("end", 113, None)), - self.get_expected_result(224 + corrids[2], - corrids[2], 113, trial, - "end"), - precreated_shm2_handles), - kwargs={ - 'sequence_name': "{}".format(self._testMethodName) - })) + self.get_expected_result( + 224 + corrids[2], corrids[2], 113, trial, "end" + ), + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_async, @@ -176,15 +183,20 @@ def test_skip_batch(self): corrids[3], (None, None), # (flag_str, value, pre_delay_ms) - (("start", 1111, None), (None, 1112, None), - (None, 1113, None), ("end", 1114, None)), - self.get_expected_result(4450 + corrids[3], - corrids[3], 1114, trial, - "end"), - precreated_shm3_handles), - kwargs={ - 'sequence_name': "{}".format(self._testMethodName) - })) + ( + ("start", 1111, None), + (None, 1112, None), + (None, 1113, None), + ("end", 1114, None), + ), + self.get_expected_result( + 4450 + corrids[3], corrids[3], 1114, trial, "end" + ), + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads[1].start() threads[3].start() @@ -210,5 +222,5 @@ def test_skip_batch(self): self.cleanup_shm_regions(precreated_shm3_handles) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_sequence_corrid_batcher/test.sh b/qa/L0_sequence_corrid_batcher/test.sh index 83a8085342..8d114a395a 100755 --- a/qa/L0_sequence_corrid_batcher/test.sh +++ b/qa/L0_sequence_corrid_batcher/test.sh @@ -57,7 +57,7 @@ export CUDA_VISIBLE_DEVICES=0 # Setup non-variable-size model repositories. The same models are in each # repository but they are configured as: # models4 - four instances with batch-size 1 -rm -fr *.log *.serverlog models{0,1,2,4} && mkdir models4 +rm -fr *.log models{0,1,2,4} && mkdir models4 for m in \ $DATADIR/qa_dyna_sequence_model_repository/graphdef_dyna_sequence_int32 \ $DATADIR/qa_dyna_sequence_model_repository/savedmodel_dyna_sequence_int32 \ @@ -88,7 +88,7 @@ for model_trial in 4; do export TRITONSERVER_BACKLOG_DELAY_SCHEDULER=0 export TRITONSERVER_DELAY_SCHEDULER=12 SERVER_ARGS="--model-repository=`pwd`/$MODEL_DIR" - SERVER_LOG="./$i.$MODEL_DIR.serverlog" + SERVER_LOG="./$i.$MODEL_DIR.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" diff --git a/qa/L0_sequence_stress/sequence_stress.py b/qa/L0_sequence_stress/sequence_stress.py old mode 100644 new mode 100755 index 44679e171e..039cf793a2 --- a/qa/L0_sequence_stress/sequence_stress.py +++ b/qa/L0_sequence_stress/sequence_stress.py @@ -1,4 +1,6 @@ -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,17 +27,18 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") import argparse -from builtins import range -from builtins import str -import time import threading +import time import traceback +from builtins import range, str +from functools import partial + import numpy as np import test_util as tu -from functools import partial import tritongrpcclient as grpcclient from tritonclientutils import np_to_triton_dtype @@ -55,7 +58,6 @@ class UserData: - def __init__(self): self._completed_requests = queue.Queue() @@ -70,21 +72,27 @@ class TimeoutException(Exception): pass -def check_sequence_async(client_metadata, - trial, - model_name, - input_dtype, - steps, - timeout_ms=DEFAULT_TIMEOUT_MS, - sequence_name=""): +def check_sequence_async( + client_metadata, + trial, + model_name, + input_dtype, + steps, + timeout_ms=DEFAULT_TIMEOUT_MS, + sequence_name="", +): """Perform sequence of inferences using async run. The 'steps' holds a list of tuples, one for each inference with format: (flag_str, value, expected_result, delay_ms) """ - if (("savedmodel" in trial) or ("graphdef" in trial) or - ("custom" in trial) or ("plan" in trial)): + if ( + ("savedmodel" in trial) + or ("graphdef" in trial) + or ("custom" in trial) + or ("plan" in trial) + ): tensor_shape = ( 1, 1, @@ -107,27 +115,29 @@ def check_sequence_async(client_metadata, seq_start = False seq_end = False if flag_str is not None: - seq_start = ("start" in flag_str) - seq_end = ("end" in flag_str) + seq_start = "start" in flag_str + seq_end = "end" in flag_str if input_dtype == np.object_: in0 = np.full(tensor_shape, value, dtype=np.int32) - in0n = np.array([str(x) for x in in0.reshape(in0.size)], - dtype=object) + in0n = np.array([str(x) for x in in0.reshape(in0.size)], dtype=object) in0 = in0n.reshape(tensor_shape) else: in0 = np.full(tensor_shape, value, dtype=input_dtype) inputs = [ - grpcclient.InferInput("INPUT", tensor_shape, - np_to_triton_dtype(input_dtype)), + grpcclient.InferInput( + "INPUT", tensor_shape, np_to_triton_dtype(input_dtype) + ), ] inputs[0].set_data_from_numpy(in0) - triton_client.async_stream_infer(model_name, - inputs, - sequence_id=sequence_id, - sequence_start=seq_start, - sequence_end=seq_end) + triton_client.async_stream_infer( + model_name, + inputs, + sequence_id=sequence_id, + sequence_start=seq_start, + sequence_end=seq_end, + ) sent_count += 1 if delay_ms is not None: @@ -146,23 +156,21 @@ def check_sequence_async(client_metadata, if timeout_ms != None: now_ms = int(round(time.time() * 1000)) if (now_ms - seq_start_ms) > timeout_ms: - raise TimeoutException( - "Timeout expired for {}".format(sequence_name)) + raise TimeoutException("Timeout expired for {}".format(sequence_name)) result = results.as_numpy("OUTPUT")[0][0] if FLAGS.verbose: - print("{} {}: + {} = {}".format(sequence_name, sequence_id, value, - result)) + print("{} {}: + {} = {}".format(sequence_name, sequence_id, value, result)) if expected is not None: if input_dtype == np.object_: - assert int( - result - ) == expected, "{}: expected result {}, got {}".format( - sequence_name, expected, int(result)) + assert int(result) == expected, "{}: expected result {}, got {}".format( + sequence_name, expected, int(result) + ) else: assert result == expected, "{}: expected result {}, got {}".format( - sequence_name, expected, result) + sequence_name, expected, result + ) triton_client.stop_stream() @@ -175,12 +183,12 @@ def get_datatype(trial): return np.int32 -def sequence_valid(client_metadata, rng, trial, model_name, dtype, len_mean, - len_stddev, sequence_name): +def sequence_valid( + client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name +): # Create a variable length sequence with "start" and "end" flags. seqlen = max(1, int(rng.normal(len_mean, len_stddev))) - print("{} {}: valid seqlen = {}".format(sequence_name, client_metadata[1], - seqlen)) + print("{} {}: valid seqlen = {}".format(sequence_name, client_metadata[1], seqlen)) values = rng.randint(0, 1024 * 1024, size=seqlen, dtype=dtype) @@ -199,31 +207,34 @@ def sequence_valid(client_metadata, rng, trial, model_name, dtype, len_mean, expected_result += val # (flag_str, value, expected_result, delay_ms) - steps.append((flags, val, expected_result, delay_ms),) + steps.append( + (flags, val, expected_result, delay_ms), + ) - check_sequence_async(client_metadata, - trial, - model_name, - dtype, - steps, - sequence_name=sequence_name) + check_sequence_async( + client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name + ) -def sequence_valid_valid(client_metadata, rng, trial, model_name, dtype, - len_mean, len_stddev, sequence_name): +def sequence_valid_valid( + client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name +): # Create two variable length sequences with "start" and "end" # flags, where both sequences use the same correlation ID and are # sent back-to-back. seqlen = [ max(1, int(rng.normal(len_mean, len_stddev))), - max(1, int(rng.normal(len_mean, len_stddev))) + max(1, int(rng.normal(len_mean, len_stddev))), ] - print("{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format( - sequence_name, client_metadata[1], seqlen[0], seqlen[1])) + print( + "{} {}: valid-valid seqlen[0] = {}, seqlen[1] = {}".format( + sequence_name, client_metadata[1], seqlen[0], seqlen[1] + ) + ) values = [ rng.randint(0, 1024 * 1024, size=seqlen[0], dtype=dtype), - rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype) + rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype), ] for p in [0, 1]: @@ -242,31 +253,34 @@ def sequence_valid_valid(client_metadata, rng, trial, model_name, dtype, expected_result += val # (flag_str, value, expected_result, delay_ms) - steps.append((flags, val, expected_result, delay_ms),) + steps.append( + (flags, val, expected_result, delay_ms), + ) - check_sequence_async(client_metadata, - trial, - model_name, - dtype, - steps, - sequence_name=sequence_name) + check_sequence_async( + client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name + ) -def sequence_valid_no_end(client_metadata, rng, trial, model_name, dtype, - len_mean, len_stddev, sequence_name): +def sequence_valid_no_end( + client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name +): # Create two variable length sequences, the first with "start" and # "end" flags and the second with no "end" flag, where both # sequences use the same correlation ID and are sent back-to-back. seqlen = [ max(1, int(rng.normal(len_mean, len_stddev))), - max(1, int(rng.normal(len_mean, len_stddev))) + max(1, int(rng.normal(len_mean, len_stddev))), ] - print("{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format( - sequence_name, client_metadata[1], seqlen[0], seqlen[1])) + print( + "{} {}: valid-no-end seqlen[0] = {}, seqlen[1] = {}".format( + sequence_name, client_metadata[1], seqlen[0], seqlen[1] + ) + ) values = [ rng.randint(0, 1024 * 1024, size=seqlen[0], dtype=dtype), - rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype) + rng.randint(0, 1024 * 1024, size=seqlen[1], dtype=dtype), ] for p in [0, 1]: @@ -285,23 +299,22 @@ def sequence_valid_no_end(client_metadata, rng, trial, model_name, dtype, expected_result += val # (flag_str, value, expected_result, delay_ms) - steps.append((flags, val, expected_result, delay_ms),) + steps.append( + (flags, val, expected_result, delay_ms), + ) - check_sequence_async(client_metadata, - trial, - model_name, - dtype, - steps, - sequence_name=sequence_name) + check_sequence_async( + client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name + ) -def sequence_no_start(client_metadata, rng, trial, model_name, dtype, - sequence_name): +def sequence_no_start(client_metadata, rng, trial, model_name, dtype, sequence_name): # Create a sequence without a "start" flag. Sequence should get an # error from the server. seqlen = 1 - print("{} {}: no-start seqlen = {}".format(sequence_name, - client_metadata[1], seqlen)) + print( + "{} {}: no-start seqlen = {}".format(sequence_name, client_metadata[1], seqlen) + ) values = rng.randint(0, 1024 * 1024, size=seqlen, dtype=dtype) @@ -313,29 +326,33 @@ def sequence_no_start(client_metadata, rng, trial, model_name, dtype, delay_ms = None # (flag_str, value, expected_result, delay_ms) - steps.append((flags, val, None, delay_ms),) + steps.append( + (flags, val, None, delay_ms), + ) try: - check_sequence_async(client_metadata, - trial, - model_name, - dtype, - steps, - sequence_name=sequence_name) + check_sequence_async( + client_metadata, + trial, + model_name, + dtype, + steps, + sequence_name=sequence_name, + ) assert False, "expected inference failure from missing START flag" except Exception as ex: if "must specify the START flag" not in ex.message(): raise -def sequence_no_end(client_metadata, rng, trial, model_name, dtype, len_mean, - len_stddev, sequence_name): +def sequence_no_end( + client_metadata, rng, trial, model_name, dtype, len_mean, len_stddev, sequence_name +): # Create a variable length sequence with "start" flag but that # never ends. The sequence should be aborted by the server and its # slot reused for another sequence. seqlen = max(1, int(rng.normal(len_mean, len_stddev))) - print("{} {}: no-end seqlen = {}".format(sequence_name, client_metadata[1], - seqlen)) + print("{} {}: no-end seqlen = {}".format(sequence_name, client_metadata[1], seqlen)) values = rng.randint(0, 1024 * 1024, size=seqlen, dtype=dtype) @@ -352,18 +369,16 @@ def sequence_no_end(client_metadata, rng, trial, model_name, dtype, len_mean, expected_result += val # (flag_str, value, expected_result, delay_ms) - steps.append((flags, val, expected_result, delay_ms),) + steps.append( + (flags, val, expected_result, delay_ms), + ) - check_sequence_async(client_metadata, - trial, - model_name, - dtype, - steps, - sequence_name=sequence_name) + check_sequence_async( + client_metadata, trial, model_name, dtype, steps, sequence_name=sequence_name + ) -def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name, - dtype): +def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name, dtype): # Thread responsible for generating sequences of inference # requests. global _thread_exceptions @@ -389,9 +404,13 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name, for c in range(common_cnt + rare_cnt): client_metadata_list.append( - (grpcclient.InferenceServerClient("localhost:8001", - verbose=FLAGS.verbose), - correlation_id_base + c)) + ( + grpcclient.InferenceServerClient( + "localhost:8001", verbose=FLAGS.verbose + ), + correlation_id_base + c, + ) + ) last_choices.append(None) rare_idx = 0 @@ -407,34 +426,40 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name, # exercise the idle sequence path of the sequence # scheduler if choice < 0.33: - sequence_no_end(client_metadata_list[client_idx], - rng, - trial, - model_name, - dtype, - SEQUENCE_LENGTH_MEAN, - SEQUENCE_LENGTH_STDEV, - sequence_name=name) + sequence_no_end( + client_metadata_list[client_idx], + rng, + trial, + model_name, + dtype, + SEQUENCE_LENGTH_MEAN, + SEQUENCE_LENGTH_STDEV, + sequence_name=name, + ) last_choices[client_idx] = "no-end" elif choice < 0.66: - sequence_valid_no_end(client_metadata_list[client_idx], - rng, - trial, - model_name, - dtype, - SEQUENCE_LENGTH_MEAN, - SEQUENCE_LENGTH_STDEV, - sequence_name=name) + sequence_valid_no_end( + client_metadata_list[client_idx], + rng, + trial, + model_name, + dtype, + SEQUENCE_LENGTH_MEAN, + SEQUENCE_LENGTH_STDEV, + sequence_name=name, + ) last_choices[client_idx] = "valid-no-end" else: - sequence_valid_valid(client_metadata_list[client_idx], - rng, - trial, - model_name, - dtype, - SEQUENCE_LENGTH_MEAN, - SEQUENCE_LENGTH_STDEV, - sequence_name=name) + sequence_valid_valid( + client_metadata_list[client_idx], + rng, + trial, + model_name, + dtype, + SEQUENCE_LENGTH_MEAN, + SEQUENCE_LENGTH_STDEV, + sequence_name=name, + ) last_choices[client_idx] = "valid-valid" rare_idx = (rare_idx + 1) % rare_cnt @@ -450,54 +475,67 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name, # just assume that the no-start is a continuation of # the no-end sequence instead of being a sequence # missing start flag. - if ((last_choice != "no-end") and - (last_choice != "valid-no-end") and (choice < 0.01)): - sequence_no_start(client_metadata, - rng, - trial, - model_name, - dtype, - sequence_name=name) + if ( + (last_choice != "no-end") + and (last_choice != "valid-no-end") + and (choice < 0.01) + ): + sequence_no_start( + client_metadata, + rng, + trial, + model_name, + dtype, + sequence_name=name, + ) last_choices[client_idx] = "no-start" elif choice < 0.05: - sequence_no_end(client_metadata, - rng, - trial, - model_name, - dtype, - SEQUENCE_LENGTH_MEAN, - SEQUENCE_LENGTH_STDEV, - sequence_name=name) + sequence_no_end( + client_metadata, + rng, + trial, + model_name, + dtype, + SEQUENCE_LENGTH_MEAN, + SEQUENCE_LENGTH_STDEV, + sequence_name=name, + ) last_choices[client_idx] = "no-end" elif choice < 0.10: - sequence_valid_no_end(client_metadata, - rng, - trial, - model_name, - dtype, - SEQUENCE_LENGTH_MEAN, - SEQUENCE_LENGTH_STDEV, - sequence_name=name) + sequence_valid_no_end( + client_metadata, + rng, + trial, + model_name, + dtype, + SEQUENCE_LENGTH_MEAN, + SEQUENCE_LENGTH_STDEV, + sequence_name=name, + ) last_choices[client_idx] = "valid-no-end" elif choice < 0.15: - sequence_valid_valid(client_metadata, - rng, - trial, - model_name, - dtype, - SEQUENCE_LENGTH_MEAN, - SEQUENCE_LENGTH_STDEV, - sequence_name=name) + sequence_valid_valid( + client_metadata, + rng, + trial, + model_name, + dtype, + SEQUENCE_LENGTH_MEAN, + SEQUENCE_LENGTH_STDEV, + sequence_name=name, + ) last_choices[client_idx] = "valid-valid" else: - sequence_valid(client_metadata, - rng, - trial, - model_name, - dtype, - SEQUENCE_LENGTH_MEAN, - SEQUENCE_LENGTH_STDEV, - sequence_name=name) + sequence_valid( + client_metadata, + rng, + trial, + model_name, + dtype, + SEQUENCE_LENGTH_MEAN, + SEQUENCE_LENGTH_STDEV, + sequence_name=name, + ) last_choices[client_idx] = "valid" except Exception as ex: @@ -518,38 +556,40 @@ def stress_thread(name, seed, pass_cnt, correlation_id_base, trial, model_name, def check_status(model_name): - client = grpcclient.InferenceServerClient("localhost:8001", - verbose=FLAGS.verbose) + client = grpcclient.InferenceServerClient("localhost:8001", verbose=FLAGS.verbose) stats = client.get_inference_statistics(model_name) print(stats) -if __name__ == '__main__': +if __name__ == "__main__": parser = argparse.ArgumentParser() - parser.add_argument('-v', - '--verbose', - action="store_true", - required=False, - default=False, - help='Enable verbose output') - parser.add_argument('-r', - '--random-seed', - type=int, - required=False, - help='Random seed.') - parser.add_argument('-t', - '--concurrency', - type=int, - required=False, - default=8, - help='Request concurrency. Default is 8.') parser.add_argument( - '-i', - '--iterations', + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-r", "--random-seed", type=int, required=False, help="Random seed." + ) + parser.add_argument( + "-t", + "--concurrency", + type=int, + required=False, + default=8, + help="Request concurrency. Default is 8.", + ) + parser.add_argument( + "-i", + "--iterations", type=int, required=False, default=200, - help='Number of iterations of stress test to run. Default is 200.') + help="Number of iterations of stress test to run. Default is 200.", + ) FLAGS = parser.parse_args() # Initialize the random seed. For reproducibility each thread @@ -583,10 +623,19 @@ def check_status(model_name): correlation_id_base = 1 + (idx * CORRELATION_ID_BLOCK_SIZE) threads.append( - threading.Thread(target=stress_thread, - args=(thread_name, seed, FLAGS.iterations, - correlation_id_base, trial, model_name, - dtype))) + threading.Thread( + target=stress_thread, + args=( + thread_name, + seed, + FLAGS.iterations, + correlation_id_base, + trial, + model_name, + dtype, + ), + ) + ) for t in threads: t.start() diff --git a/qa/L0_sequence_stress/test.sh b/qa/L0_sequence_stress/test.sh index 3961107dfe..b2bc66f8ac 100755 --- a/qa/L0_sequence_stress/test.sh +++ b/qa/L0_sequence_stress/test.sh @@ -39,7 +39,7 @@ RET=0 # models1 - one instance with batch-size 4 # models2 - two instances with batch-size 2 # models4 - four instances with batch-size 1 -rm -fr *.log *.serverlog models{1,2,4} && mkdir models{1,2,4} +rm -fr *.log models{1,2,4} && mkdir models{1,2,4} for m in ../custom_models/custom_sequence_int32 ; do cp -r $m models1/. && \ (cd models1/$(basename $m) && \ @@ -65,7 +65,7 @@ done for model_trial in 1 2 4 ; do MODEL_DIR=models${model_trial} SERVER_ARGS="--model-repository=`pwd`/$MODEL_DIR" - SERVER_LOG="./$MODEL_DIR.serverlog" + SERVER_LOG="./$MODEL_DIR.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" diff --git a/qa/L0_server_status/server_status_test.py b/qa/L0_server_status/server_status_test.py old mode 100644 new mode 100755 index ee6db2a575..7ab04708f0 --- a/qa/L0_server_status/server_status_test.py +++ b/qa/L0_server_status/server_status_test.py @@ -1,4 +1,6 @@ -# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,12 +27,14 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -import numpy as np import os import unittest + import infer_util as iu +import numpy as np import test_util as tu import tritongrpcclient as grpcclient import tritonhttpclient as httpclient @@ -38,24 +42,29 @@ class ServerMetadataTest(tu.TestResultCollector): - def test_basic(self): try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: model_name = "graphdef_int32_int8_int8" extensions = [ - 'classification', 'sequence', 'model_repository', - 'schedule_policy', 'model_configuration', - 'system_shared_memory', 'cuda_shared_memory', - 'binary_tensor_data', 'statistics' + "classification", + "sequence", + "model_repository", + "schedule_policy", + "model_configuration", + "system_shared_memory", + "cuda_shared_memory", + "binary_tensor_data", + "statistics", ] if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) @@ -63,16 +72,18 @@ def test_basic(self): model_metadata = triton_client.get_model_metadata(model_name) if pair[1] == "http": - self.assertEqual(os.environ["TRITON_SERVER_VERSION"], - server_metadata['version']) - self.assertEqual("triton", server_metadata['name']) + self.assertEqual( + os.environ["TRITON_SERVER_VERSION"], server_metadata["version"] + ) + self.assertEqual("triton", server_metadata["name"]) for ext in extensions: - self.assertIn(ext, server_metadata['extensions']) + self.assertIn(ext, server_metadata["extensions"]) - self.assertEqual(model_name, model_metadata['name']) + self.assertEqual(model_name, model_metadata["name"]) else: - self.assertEqual(os.environ["TRITON_SERVER_VERSION"], - server_metadata.version) + self.assertEqual( + os.environ["TRITON_SERVER_VERSION"], server_metadata.version + ) self.assertEqual("triton", server_metadata.name) for ext in extensions: self.assertIn(ext, server_metadata.extensions) @@ -83,91 +94,96 @@ def test_basic(self): def test_unknown_model(self): try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: model_name = "foo" if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) server_metadata = triton_client.get_server_metadata() if pair[1] == "http": - self.assertEqual(os.environ["TRITON_SERVER_VERSION"], - server_metadata['version']) - self.assertEqual("triton", server_metadata['name']) + self.assertEqual( + os.environ["TRITON_SERVER_VERSION"], server_metadata["version"] + ) + self.assertEqual("triton", server_metadata["name"]) else: - self.assertEqual(os.environ["TRITON_SERVER_VERSION"], - server_metadata.version) + self.assertEqual( + os.environ["TRITON_SERVER_VERSION"], server_metadata.version + ) self.assertEqual("triton", server_metadata.name) model_metadata = triton_client.get_model_metadata(model_name) self.assertTrue(False, "expected unknown model failure") except InferenceServerException as ex: - self.assertTrue(ex.message().startswith( - "Request for unknown model: 'foo' is not found")) + self.assertTrue( + ex.message().startswith("Request for unknown model: 'foo' is not found") + ) def test_unknown_model_version(self): try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: model_name = "graphdef_int32_int8_int8" if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) model_metadata = triton_client.get_model_metadata( - model_name, model_version="99") + model_name, model_version="99" + ) self.assertTrue(False, "expected unknown model version failure") except InferenceServerException as ex: - self.assertTrue(ex.message().startswith( - "Request for unknown model: 'graphdef_int32_int8_int8' version 99 is not found" - )) + self.assertTrue( + ex.message().startswith( + "Request for unknown model: 'graphdef_int32_int8_int8' version 99 is not found" + ) + ) def test_model_latest_infer(self): input_size = 16 tensor_shape = (1, input_size) - platform_name = { - 'graphdef': 'tensorflow_graphdef', - 'onnx': 'onnxruntime_onnx' - } + platform_name = {"graphdef": "tensorflow_graphdef", "onnx": "onnxruntime_onnx"} # There are 3 versions of *_int32_int32_int32 and all # should be available. - for platform in ('graphdef', 'onnx'): + for platform in ("graphdef", "onnx"): model_name = platform + "_int32_int32_int32" # Initially there should be no version stats.. try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) - model_metadata = triton_client.get_model_metadata( - model_name) + model_metadata = triton_client.get_model_metadata(model_name) # verify all versions are reported when no model version is specified if pair[1] == "http": - self.assertEqual(model_name, model_metadata['name']) - self.assertEqual(len(model_metadata['versions']), 3) + self.assertEqual(model_name, model_metadata["name"]) + self.assertEqual(len(model_metadata["versions"]), 3) for v in (1, 2, 3): - self.assertIn(str(v), model_metadata['versions']) + self.assertIn(str(v), model_metadata["versions"]) else: self.assertEqual(model_name, model_metadata.name) self.assertEqual(len(model_metadata.versions), 3) @@ -176,9 +192,9 @@ def test_model_latest_infer(self): # verify contents of model metadata if pair[1] == "http": - model_platform = model_metadata['platform'] - model_inputs = model_metadata['inputs'] - model_outputs = model_metadata['outputs'] + model_platform = model_metadata["platform"] + model_inputs = model_metadata["inputs"] + model_outputs = model_metadata["outputs"] else: model_platform = model_metadata.platform model_inputs = model_metadata.inputs @@ -190,9 +206,9 @@ def test_model_latest_infer(self): for model_input in model_inputs: if pair[1] == "http": - input_dtype = model_input['datatype'] - input_shape = model_input['shape'] - input_name = model_input['name'] + input_dtype = model_input["datatype"] + input_shape = model_input["shape"] + input_name = model_input["name"] else: input_dtype = model_input.datatype input_shape = model_input.shape @@ -203,9 +219,9 @@ def test_model_latest_infer(self): for model_output in model_outputs: if pair[1] == "http": - output_dtype = model_output['datatype'] - output_shape = model_output['shape'] - output_name = model_output['name'] + output_dtype = model_output["datatype"] + output_shape = model_output["shape"] + output_name = model_output["name"] else: output_dtype = model_output.datatype output_shape = model_output.shape @@ -218,67 +234,79 @@ def test_model_latest_infer(self): self.assertTrue(False, "unexpected error {}".format(ex)) # Infer using latest version (which is 3)... - iu.infer_exact(self, - platform, - tensor_shape, - 1, - np.int32, - np.int32, - np.int32, - model_version=None, - swap=True) + iu.infer_exact( + self, + platform, + tensor_shape, + 1, + np.int32, + np.int32, + np.int32, + model_version=None, + swap=True, + ) try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) for v in (1, 2, 3): self.assertTrue( - triton_client.is_model_ready(model_name, - model_version=str(v))) + triton_client.is_model_ready( + model_name, model_version=str(v) + ) + ) # Only version 3 should have infer stats - infer_stats = triton_client.get_inference_statistics( - model_name) + infer_stats = triton_client.get_inference_statistics(model_name) if pair[1] == "http": - stats = infer_stats['model_stats'] + stats = infer_stats["model_stats"] else: stats = infer_stats.model_stats self.assertEqual( - len(stats), 3, - "expected 3 infer stats for model " + model_name) + len(stats), 3, "expected 3 infer stats for model " + model_name + ) for s in stats: if pair[1] == "http": - v = s['version'] - stat = s['inference_stats'] + v = s["version"] + stat = s["inference_stats"] else: v = s.version stat = s.inference_stats if v == "3": if pair[1] == "http": - self.assertTrue(stat['success']['count'], 3) + self.assertTrue(stat["success"]["count"], 3) else: self.assertTrue(stat.success.count, 3) else: if pair[1] == "http": self.assertEqual( - stat['success']['count'], 0, + stat["success"]["count"], + 0, "unexpected infer success counts for version " - + str(v) + " of model " + model_name) + + str(v) + + " of model " + + model_name, + ) else: self.assertEqual( - stat.success.count, 0, + stat.success.count, + 0, "unexpected infer success counts for version " - + str(v) + " of model " + model_name) + + str(v) + + " of model " + + model_name, + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -288,136 +316,150 @@ def test_model_specific_infer(self): # There are 3 versions of *_float32_float32_float32 but only # versions 1 and 3 should be available. - for platform in ('graphdef', 'onnx', 'plan'): + for platform in ("graphdef", "onnx", "plan"): tensor_shape = (1, input_size) model_name = platform + "_float32_float32_float32" # Initially there should be no version status... try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) self.assertTrue( - triton_client.is_model_ready(model_name, - model_version="1")) + triton_client.is_model_ready(model_name, model_version="1") + ) self.assertFalse( - triton_client.is_model_ready(model_name, - model_version="2")) + triton_client.is_model_ready(model_name, model_version="2") + ) self.assertTrue( - triton_client.is_model_ready(model_name, - model_version="3")) + triton_client.is_model_ready(model_name, model_version="3") + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) # Infer using version 1... - iu.infer_exact(self, - platform, - tensor_shape, - 1, - np.float32, - np.float32, - np.float32, - model_version=1, - swap=False) + iu.infer_exact( + self, + platform, + tensor_shape, + 1, + np.float32, + np.float32, + np.float32, + model_version=1, + swap=False, + ) try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) self.assertTrue( - triton_client.is_model_ready(model_name, - model_version="1")) + triton_client.is_model_ready(model_name, model_version="1") + ) self.assertFalse( - triton_client.is_model_ready(model_name, - model_version="2")) + triton_client.is_model_ready(model_name, model_version="2") + ) self.assertTrue( - triton_client.is_model_ready(model_name, - model_version="3")) + triton_client.is_model_ready(model_name, model_version="3") + ) # Only version 1 should have infer stats infer_stats = triton_client.get_inference_statistics( - model_name, model_version='1') + model_name, model_version="1" + ) if pair[1] == "http": self.assertEqual( - len(infer_stats['model_stats']), 1, + len(infer_stats["model_stats"]), + 1, "expected 1 infer stats for version 1" - " of model " + model_name) - stats = infer_stats['model_stats'][0]['inference_stats'] - self.assertTrue(stats['success']['count'], 3) + " of model " + model_name, + ) + stats = infer_stats["model_stats"][0]["inference_stats"] + self.assertTrue(stats["success"]["count"], 3) else: self.assertEqual( - len(infer_stats.model_stats), 1, + len(infer_stats.model_stats), + 1, "expected 1 infer stats for version 1" - " of model " + model_name) + " of model " + model_name, + ) stats = infer_stats.model_stats[0].inference_stats self.assertTrue(stats.success.count, 3) infer_stats = triton_client.get_inference_statistics( - model_name, model_version='3') + model_name, model_version="3" + ) if pair[1] == "http": - stats = infer_stats['model_stats'][0]['inference_stats'] + stats = infer_stats["model_stats"][0]["inference_stats"] self.assertEqual( - stats['success']['count'], 0, + stats["success"]["count"], + 0, "unexpected infer stats for version 3" - " of model " + model_name) + " of model " + model_name, + ) else: stats = infer_stats.model_stats[0].inference_stats self.assertEqual( - stats.success.count, 0, + stats.success.count, + 0, "unexpected infer stats for version 3" - " of model " + model_name) + " of model " + model_name, + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) class ModelMetadataTest(tu.TestResultCollector): - ''' + """ These tests must be run after the ServerMetadataTest. See test.sh file for correct test running. - ''' + """ def test_model_versions_deleted(self): # Originally There were 3 versions of *_int32_int32_int32 and # version 3 was executed once. Version 2 and 3 models were # deleted from the model repository so now only expect version 1 to # be ready and show stats. - for platform in ('graphdef', 'onnx'): + for platform in ("graphdef", "onnx"): model_name = platform + "_int32_int32_int32" try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) - model_metadata = triton_client.get_model_metadata( - model_name) + model_metadata = triton_client.get_model_metadata(model_name) if pair[1] == "http": - self.assertEqual(model_name, model_metadata['name']) - self.assertEqual(len(model_metadata['versions']), 1) - self.assertEqual("1", model_metadata['versions'][0]) + self.assertEqual(model_name, model_metadata["name"]) + self.assertEqual(len(model_metadata["versions"]), 1) + self.assertEqual("1", model_metadata["versions"][0]) else: self.assertEqual(model_name, model_metadata.name) self.assertEqual(len(model_metadata.versions), 1) @@ -428,30 +470,41 @@ def test_model_versions_deleted(self): if v == 1: self.assertTrue( triton_client.is_model_ready( - model_name, model_version=str(v))) + model_name, model_version=str(v) + ) + ) infer_stats = triton_client.get_inference_statistics( - model_name, model_version=str(v)) + model_name, model_version=str(v) + ) if pair[1] == "http": self.assertEqual( - len(infer_stats['model_stats']), 1, - "expected 1 infer stats for version " + - str(v) + " of model " + model_name) - stats = infer_stats['model_stats'][0][ - 'inference_stats'] - self.assertEqual(stats['success']['count'], 0) + len(infer_stats["model_stats"]), + 1, + "expected 1 infer stats for version " + + str(v) + + " of model " + + model_name, + ) + stats = infer_stats["model_stats"][0]["inference_stats"] + self.assertEqual(stats["success"]["count"], 0) else: self.assertEqual( - len(infer_stats.model_stats), 1, - "expected 1 infer stats for version " + - str(v) + " of model " + model_name) - stats = infer_stats.model_stats[ - 0].inference_stats + len(infer_stats.model_stats), + 1, + "expected 1 infer stats for version " + + str(v) + + " of model " + + model_name, + ) + stats = infer_stats.model_stats[0].inference_stats self.assertEqual(stats.success.count, 0) else: self.assertFalse( triton_client.is_model_ready( - model_name, model_version=str(v))) + model_name, model_version=str(v) + ) + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -460,40 +513,46 @@ def test_model_versions_added(self): # Originally There was version 1 of *_float16_float32_float32. # Version 7 was added so now expect just version 7 to be ready # and provide infer stats. - for platform in ('graphdef',): + for platform in ("graphdef",): model_name = platform + "_float16_float32_float32" try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) - model_metadata = triton_client.get_model_metadata( - model_name) + model_metadata = triton_client.get_model_metadata(model_name) if pair[1] == "http": self.assertEqual( - model_name, model_metadata['name'], - "expected status for model " + model_name) + model_name, + model_metadata["name"], + "expected status for model " + model_name, + ) self.assertEqual( - len(model_metadata['versions']), 1, - "expected status for 1 versions for model " + - model_name) - self.assertEqual("7", model_metadata['versions'][0]) + len(model_metadata["versions"]), + 1, + "expected status for 1 versions for model " + model_name, + ) + self.assertEqual("7", model_metadata["versions"][0]) else: self.assertEqual( - model_name, model_metadata.name, - "expected status for model " + model_name) + model_name, + model_metadata.name, + "expected status for model " + model_name, + ) self.assertEqual( - len(model_metadata.versions), 1, - "expected status for 1 versions for model " + - model_name) + len(model_metadata.versions), + 1, + "expected status for 1 versions for model " + model_name, + ) self.assertEqual("7", model_metadata.versions[0]) # Only version 7 should be ready and show infer stat. @@ -501,39 +560,52 @@ def test_model_versions_added(self): if v == 7: self.assertTrue( triton_client.is_model_ready( - model_name, model_version=str(v))) + model_name, model_version=str(v) + ) + ) infer_stats = triton_client.get_inference_statistics( - model_name, model_version=str(v)) + model_name, model_version=str(v) + ) if pair[1] == "http": - stats = infer_stats['model_stats'][0][ - 'inference_stats'] + stats = infer_stats["model_stats"][0]["inference_stats"] self.assertEqual( - stats['success']['count'], 0, - "unexpected infer stats for version " + - str(v) + " of model " + model_name) + stats["success"]["count"], + 0, + "unexpected infer stats for version " + + str(v) + + " of model " + + model_name, + ) else: - stats = infer_stats.model_stats[ - 0].inference_stats + stats = infer_stats.model_stats[0].inference_stats self.assertEqual( - stats.success.count, 0, - "unexpected infer stats for version " + - str(v) + " of model " + model_name) + stats.success.count, + 0, + "unexpected infer stats for version " + + str(v) + + " of model " + + model_name, + ) else: self.assertFalse( triton_client.is_model_ready( - model_name, model_version=str(v))) + model_name, model_version=str(v) + ) + ) try: infer_stats = triton_client.get_inference_statistics( - model_name, model_version=str(v)) + model_name, model_version=str(v) + ) self.assertTrue( False, - "unexpected infer stats for the model that is not ready" + "unexpected infer stats for the model that is not ready", ) except InferenceServerException as ex: self.assertIn( "requested model version is not available for model", - str(ex)) + str(ex), + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -543,27 +615,27 @@ def test_infer_stats_no_model_version(self): # version 3 was executed once. Version 2 and 3 models were # deleted from the model repository so now only expect version 1 to # be ready and show infer stats. - for platform in ('graphdef', 'onnx'): + for platform in ("graphdef", "onnx"): model_name = platform + "_int32_int32_int32" try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) - model_metadata = triton_client.get_model_metadata( - model_name) + model_metadata = triton_client.get_model_metadata(model_name) if pair[1] == "http": - self.assertEqual(model_name, model_metadata['name']) - self.assertEqual(len(model_metadata['versions']), 1) - self.assertEqual("1", model_metadata['versions'][0]) + self.assertEqual(model_name, model_metadata["name"]) + self.assertEqual(len(model_metadata["versions"]), 1) + self.assertEqual("1", model_metadata["versions"][0]) else: self.assertEqual(model_name, model_metadata.name) self.assertEqual(len(model_metadata.versions), 1) @@ -574,44 +646,55 @@ def test_infer_stats_no_model_version(self): if v == 1: self.assertTrue( triton_client.is_model_ready( - model_name, model_version=str(v))) + model_name, model_version=str(v) + ) + ) else: self.assertFalse( triton_client.is_model_ready( - model_name, model_version=str(v))) + model_name, model_version=str(v) + ) + ) - infer_stats = triton_client.get_inference_statistics( - model_name) + infer_stats = triton_client.get_inference_statistics(model_name) if pair[1] == "http": - stats = infer_stats['model_stats'] + stats = infer_stats["model_stats"] else: stats = infer_stats.model_stats self.assertEqual( - len(stats), 1, - "expected 1 infer stats for model " + model_name) + len(stats), 1, "expected 1 infer stats for model " + model_name + ) if pair[1] == "http": - version = stats[0]['version'] - stat = stats[0]['inference_stats'] + version = stats[0]["version"] + stat = stats[0]["inference_stats"] else: version = stats[0].version stat = stats[0].inference_stats if version != "1": self.assertTrue( - False, - "expected version 1 for infer stat, got " + version) + False, "expected version 1 for infer stat, got " + version + ) else: if pair[1] == "http": self.assertEqual( - stat['success']['count'], 0, - "unexpected infer stats for version " + - str(version) + " of model " + model_name) + stat["success"]["count"], + 0, + "unexpected infer stats for version " + + str(version) + + " of model " + + model_name, + ) else: self.assertEqual( - stat.success.count, 0, - "unexpected infer stats for version " + - str(version) + " of model " + model_name) + stat.success.count, + 0, + "unexpected infer stats for version " + + str(version) + + " of model " + + model_name, + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -619,14 +702,15 @@ def test_infer_stats_no_model_version(self): def test_infer_stats_no_model(self): # Test get_inference_statistics when no model/model_version is passed. try: - for pair in [("localhost:8000", "http"), - ("localhost:8001", "grpc")]: + for pair in [("localhost:8000", "http"), ("localhost:8001", "grpc")]: if pair[1] == "http": triton_client = httpclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) else: triton_client = grpcclient.InferenceServerClient( - url=pair[0], verbose=True) + url=pair[0], verbose=True + ) self.assertTrue(triton_client.is_server_live()) self.assertTrue(triton_client.is_server_ready()) @@ -634,17 +718,18 @@ def test_infer_stats_no_model(self): # Returns infer stats for ALL models + ready versions infer_stats = triton_client.get_inference_statistics() if pair[1] == "http": - stats = infer_stats['model_stats'] + stats = infer_stats["model_stats"] else: stats = infer_stats.model_stats self.assertEqual( - len(stats), 207, - "expected 207 infer stats for all ready versions of all model" + len(stats), + 219, + "expected 219 infer stats for all ready versions of all model", ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_shared_memory/shared_memory_test.py b/qa/L0_shared_memory/shared_memory_test.py old mode 100644 new mode 100755 index 867f6a85b8..6350dc2abe --- a/qa/L0_shared_memory/shared_memory_test.py +++ b/qa/L0_shared_memory/shared_memory_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,11 +27,13 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -import numpy as np -import unittest import os +import unittest + +import numpy as np import test_util as tu import tritonclient.grpc as grpcclient import tritonclient.http as httpclient @@ -38,12 +42,12 @@ class SharedMemoryTest(tu.TestResultCollector): - def test_invalid_create_shm(self): # Raises error since tried to create invalid system shared memory region try: shm_op0_handle = shm.create_shared_memory_region( - "dummy_data", "/dummy_data", -1) + "dummy_data", "/dummy_data", -1 + ) shm.destroy_shared_memory_region(shm_op0_handle) except Exception as ex: self.assertTrue(str(ex) == "unable to initialize the size") @@ -54,12 +58,11 @@ def test_valid_create_set_register(self): triton_client = httpclient.InferenceServerClient(_url, verbose=True) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) - shm_op0_handle = shm.create_shared_memory_region( - "dummy_data", "/dummy_data", 8) - shm.set_shared_memory_region(shm_op0_handle, - [np.array([1, 2], dtype=np.float32)]) - triton_client.register_system_shared_memory("dummy_data", "/dummy_data", - 8) + shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8) + shm.set_shared_memory_region( + shm_op0_handle, [np.array([1, 2], dtype=np.float32)] + ) + triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8) shm_status = triton_client.get_system_shared_memory_status() if _protocol == "http": self.assertTrue(len(shm_status) == 1) @@ -73,8 +76,7 @@ def test_unregister_before_register(self): triton_client = httpclient.InferenceServerClient(_url, verbose=True) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) - shm_op0_handle = shm.create_shared_memory_region( - "dummy_data", "/dummy_data", 8) + shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8) triton_client.unregister_system_shared_memory("dummy_data") shm_status = triton_client.get_system_shared_memory_status() if _protocol == "http": @@ -89,10 +91,8 @@ def test_unregister_after_register(self): triton_client = httpclient.InferenceServerClient(_url, verbose=True) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) - shm_op0_handle = shm.create_shared_memory_region( - "dummy_data", "/dummy_data", 8) - triton_client.register_system_shared_memory("dummy_data", "/dummy_data", - 8) + shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8) + triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8) triton_client.unregister_system_shared_memory("dummy_data") shm_status = triton_client.get_system_shared_memory_status() if _protocol == "http": @@ -107,17 +107,14 @@ def test_reregister_after_register(self): triton_client = httpclient.InferenceServerClient(_url, verbose=True) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) - shm_op0_handle = shm.create_shared_memory_region( - "dummy_data", "/dummy_data", 8) - triton_client.register_system_shared_memory("dummy_data", "/dummy_data", - 8) + shm_op0_handle = shm.create_shared_memory_region("dummy_data", "/dummy_data", 8) + triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8) try: - triton_client.register_system_shared_memory("dummy_data", - "/dummy_data", 8) + triton_client.register_system_shared_memory("dummy_data", "/dummy_data", 8) except Exception as ex: self.assertTrue( - "shared memory region 'dummy_data' already in manager" in str( - ex)) + "shared memory region 'dummy_data' already in manager" in str(ex) + ) shm_status = triton_client.get_system_shared_memory_status() if _protocol == "http": self.assertTrue(len(shm_status) == 1) @@ -127,13 +124,17 @@ def test_reregister_after_register(self): def _configure_sever(self): shm_ip0_handle = shm.create_shared_memory_region( - "input0_data", "/input0_data", 64) + "input0_data", "/input0_data", 64 + ) shm_ip1_handle = shm.create_shared_memory_region( - "input1_data", "/input1_data", 64) + "input1_data", "/input1_data", 64 + ) shm_op0_handle = shm.create_shared_memory_region( - "output0_data", "/output0_data", 64) + "output0_data", "/output0_data", 64 + ) shm_op1_handle = shm.create_shared_memory_region( - "output1_data", "/output1_data", 64) + "output1_data", "/output1_data", 64 + ) input0_data = np.arange(start=0, stop=16, dtype=np.int32) input1_data = np.ones(shape=16, dtype=np.int32) shm.set_shared_memory_region(shm_ip0_handle, [input0_data]) @@ -142,28 +143,26 @@ def _configure_sever(self): triton_client = httpclient.InferenceServerClient(_url, verbose=True) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) - triton_client.register_system_shared_memory("input0_data", - "/input0_data", 64) - triton_client.register_system_shared_memory("input1_data", - "/input1_data", 64) - triton_client.register_system_shared_memory("output0_data", - "/output0_data", 64) - triton_client.register_system_shared_memory("output1_data", - "/output1_data", 64) + triton_client.register_system_shared_memory("input0_data", "/input0_data", 64) + triton_client.register_system_shared_memory("input1_data", "/input1_data", 64) + triton_client.register_system_shared_memory("output0_data", "/output0_data", 64) + triton_client.register_system_shared_memory("output1_data", "/output1_data", 64) return [shm_ip0_handle, shm_ip1_handle, shm_op0_handle, shm_op1_handle] def _cleanup_server(self, shm_handles): for shm_handle in shm_handles: shm.destroy_shared_memory_region(shm_handle) - def _basic_inference(self, - shm_ip0_handle, - shm_ip1_handle, - shm_op0_handle, - shm_op1_handle, - error_msg, - big_shm_name="", - big_shm_size=64): + def _basic_inference( + self, + shm_ip0_handle, + shm_ip1_handle, + shm_op0_handle, + shm_op1_handle, + error_msg, + big_shm_name="", + big_shm_size=64, + ): input0_data = np.arange(start=0, stop=16, dtype=np.int32) input1_data = np.ones(shape=16, dtype=np.int32) inputs = [] @@ -172,16 +171,16 @@ def _basic_inference(self, triton_client = httpclient.InferenceServerClient(_url, verbose=True) inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True)) outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) + httpclient.InferRequestedOutput("OUTPUT1", binary_data=False) + ) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) inputs.append(grpcclient.InferInput("INPUT0", [1, 16], "INT32")) inputs.append(grpcclient.InferInput("INPUT1", [1, 16], "INT32")) - outputs.append(grpcclient.InferRequestedOutput('OUTPUT0')) - outputs.append(grpcclient.InferRequestedOutput('OUTPUT1')) + outputs.append(grpcclient.InferRequestedOutput("OUTPUT0")) + outputs.append(grpcclient.InferRequestedOutput("OUTPUT1")) inputs[0].set_shared_memory("input0_data", 64) @@ -196,23 +195,24 @@ def _basic_inference(self, outputs[1].set_shared_memory("output1_data", 64) try: - results = triton_client.infer("simple", - inputs, - model_version="", - outputs=outputs) - output = results.get_output('OUTPUT0') + results = triton_client.infer( + "simple", inputs, model_version="", outputs=outputs + ) + output = results.get_output("OUTPUT0") if _protocol == "http": - output_datatype = output['datatype'] - output_shape = output['shape'] + output_datatype = output["datatype"] + output_shape = output["shape"] else: output_datatype = output.datatype output_shape = output.shape output_dtype = utils.triton_to_np_dtype(output_datatype) - output_data = shm.get_contents_as_numpy(shm_op0_handle, - output_dtype, output_shape) + output_data = shm.get_contents_as_numpy( + shm_op0_handle, output_dtype, output_shape + ) self.assertTrue( (output_data[0] == (input0_data + input1_data)).all(), - "Model output does not match expected output") + "Model output does not match expected output", + ) except Exception as ex: error_msg.append(str(ex)) @@ -220,8 +220,9 @@ def test_unregister_after_inference(self): # Unregister after inference error_msg = [] shm_handles = self._configure_sever() - self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2], - shm_handles[3], error_msg) + self._basic_inference( + shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg + ) if len(error_msg) > 0: raise Exception(str(error_msg)) if _protocol == "http": @@ -244,14 +245,15 @@ def test_register_after_inference(self): triton_client = httpclient.InferenceServerClient(_url, verbose=True) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) - self._basic_inference(shm_handles[0], shm_handles[1], shm_handles[2], - shm_handles[3], error_msg) + self._basic_inference( + shm_handles[0], shm_handles[1], shm_handles[2], shm_handles[3], error_msg + ) if len(error_msg) > 0: raise Exception(str(error_msg)) shm_ip2_handle = shm.create_shared_memory_region( - "input2_data", "/input2_data", 64) - triton_client.register_system_shared_memory("input2_data", - "/input2_data", 64) + "input2_data", "/input2_data", 64 + ) + triton_client.register_system_shared_memory("input2_data", "/input2_data", 64) shm_status = triton_client.get_system_shared_memory_status() if _protocol == "http": self.assertTrue(len(shm_status) == 5) @@ -265,19 +267,27 @@ def test_too_big_shm(self): error_msg = [] shm_handles = self._configure_sever() shm_ip2_handle = shm.create_shared_memory_region( - "input2_data", "/input2_data", 128) + "input2_data", "/input2_data", 128 + ) if _protocol == "http": triton_client = httpclient.InferenceServerClient(_url, verbose=True) else: triton_client = grpcclient.InferenceServerClient(_url, verbose=True) - triton_client.register_system_shared_memory("input2_data", - "/input2_data", 128) - self._basic_inference(shm_handles[0], shm_ip2_handle, shm_handles[2], - shm_handles[3], error_msg, "input2_data", 128) + triton_client.register_system_shared_memory("input2_data", "/input2_data", 128) + self._basic_inference( + shm_handles[0], + shm_ip2_handle, + shm_handles[2], + shm_handles[3], + error_msg, + "input2_data", + 128, + ) if len(error_msg) > 0: self.assertTrue( "unexpected total byte size 128 for input 'INPUT1', expecting 64" - in error_msg[-1]) + in error_msg[-1] + ) shm_handles.append(shm_ip2_handle) self._cleanup_server(shm_handles) @@ -286,8 +296,9 @@ def test_mixed_raw_shm(self): error_msg = [] shm_handles = self._configure_sever() input1_data = np.ones(shape=16, dtype=np.int32) - self._basic_inference(shm_handles[0], [input1_data], shm_handles[2], - shm_handles[3], error_msg) + self._basic_inference( + shm_handles[0], [input1_data], shm_handles[2], shm_handles[3], error_msg + ) if len(error_msg) > 0: raise Exception(error_msg[-1]) self._cleanup_server(shm_handles) @@ -313,8 +324,8 @@ def test_unregisterall(self): self._cleanup_server(shm_handles) -if __name__ == '__main__': - _protocol = os.environ.get('CLIENT_TYPE', "http") +if __name__ == "__main__": + _protocol = os.environ.get("CLIENT_TYPE", "http") if _protocol == "http": _url = "localhost:8000" else: diff --git a/qa/L0_shared_memory/test.sh b/qa/L0_shared_memory/test.sh old mode 100644 new mode 100755 index b510688740..e30a7dffa7 --- a/qa/L0_shared_memory/test.sh +++ b/qa/L0_shared_memory/test.sh @@ -52,7 +52,7 @@ for i in \ test_unregisterall; do for client_type in http grpc; do SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1 ${SERVER_ARGS_EXTRA}" - SERVER_LOG="./$i.$client_type.serverlog" + SERVER_LOG="./$i.$client_type.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" diff --git a/qa/L0_simple_ensemble/ensemble_test.py b/qa/L0_simple_ensemble/ensemble_test.py old mode 100644 new mode 100755 index b91097dfee..0b064c13e8 --- a/qa/L0_simple_ensemble/ensemble_test.py +++ b/qa/L0_simple_ensemble/ensemble_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,77 +27,82 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") sys.path.append("../clients") import logging - -import os import unittest -import numpy as np + import infer_util as iu +import numpy as np import test_util as tu import tritonhttpclient class EnsembleTest(tu.TestResultCollector): - def _get_infer_count_per_version(self, model_name): - triton_client = tritonhttpclient.InferenceServerClient("localhost:8000", - verbose=True) + triton_client = tritonhttpclient.InferenceServerClient( + "localhost:8000", verbose=True + ) stats = triton_client.get_inference_statistics(model_name) self.assertEqual(len(stats["model_stats"]), 2) infer_count = [0, 0] for model_stat in stats["model_stats"]: - self.assertEqual(model_stat["name"], model_name, - "expected stats for model " + model_name) - model_version = model_stat['version'] + self.assertEqual( + model_stat["name"], model_name, "expected stats for model " + model_name + ) + model_version = model_stat["version"] if model_version == "1": - infer_count[0] = model_stat["inference_stats"]["success"][ - "count"] + infer_count[0] = model_stat["inference_stats"]["success"]["count"] elif model_version == "2": - infer_count[1] = model_stat["inference_stats"]["success"][ - "count"] + infer_count[1] = model_stat["inference_stats"]["success"]["count"] else: self.assertTrue( - False, "unexpected version {} for model {}".format( - model_version, model_name)) + False, + "unexpected version {} for model {}".format( + model_version, model_name + ), + ) return infer_count def test_ensemble_add_sub(self): for bs in (1, 8): - iu.infer_exact(self, "ensemble_add_sub", (bs, 16), bs, np.int32, - np.int32, np.int32) + iu.infer_exact( + self, "ensemble_add_sub", (bs, 16), bs, np.int32, np.int32, np.int32 + ) infer_count = self._get_infer_count_per_version("simple") # The two 'simple' versions should have the same infer count - if (infer_count[0] != infer_count[1]): + if infer_count[0] != infer_count[1]: self.assertTrue( - False, - "unexpeced different infer count for different 'simple' versions" + False, "unexpeced different infer count for different 'simple' versions" ) def test_ensemble_add_sub_one_output(self): for bs in (1, 8): - iu.infer_exact(self, - "ensemble_add_sub", (bs, 16), - bs, - np.int32, - np.int32, - np.int32, - outputs=("OUTPUT0",)) + iu.infer_exact( + self, + "ensemble_add_sub", + (bs, 16), + bs, + np.int32, + np.int32, + np.int32, + outputs=("OUTPUT0",), + ) infer_count = self._get_infer_count_per_version("simple") # Only 'simple' version 2 should have non-zero infer count # as it is in charge of producing OUTPUT0 - if (infer_count[0] != 0): - self.assertTrue( - False, "unexpeced non-zero infer count for 'simple' version 1") - elif (infer_count[1] == 0): + if infer_count[0] != 0: self.assertTrue( - False, "unexpeced zero infer count for 'simple' version 2") + False, "unexpeced non-zero infer count for 'simple' version 1" + ) + elif infer_count[1] == 0: + self.assertTrue(False, "unexpeced zero infer count for 'simple' version 2") -if __name__ == '__main__': +if __name__ == "__main__": logging.basicConfig(stream=sys.stderr) unittest.main() diff --git a/qa/L0_simple_go_client/test.sh b/qa/L0_simple_go_client/test.sh old mode 100644 new mode 100755 index fcf7ed41b5..f09b79bfa2 --- a/qa/L0_simple_go_client/test.sh +++ b/qa/L0_simple_go_client/test.sh @@ -29,7 +29,8 @@ export CUDA_VISIBLE_DEVICES=0 TRITON_COMMON_REPO_TAG=${TRITON_COMMON_REPO_TAG:="main"} -SIMPLE_GO_CLIENT=grpc_simple_client.go +GO_CLIENT_DIR=client/src/grpc_generated/go +SIMPLE_GO_CLIENT=${GO_CLIENT_DIR}/grpc_simple_client.go SERVER=/opt/tritonserver/bin/tritonserver SERVER_ARGS=--model-repository=`pwd`/models @@ -47,28 +48,26 @@ fi RET=0 -# Fix to allow global stubs import -sed -i 's/.\/nvidia_inferenceserver/nvidia_inferenceserver/g' $SIMPLE_GO_CLIENT +# Generate Go stubs. +rm -fr client common +git clone https://github.com/triton-inference-server/client.git +go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest -PACKAGE_PATH="${GOPATH}/src" -mkdir -p ${PACKAGE_PATH} - -# Get the proto files from the common repo -rm -fr common +pushd ${GO_CLIENT_DIR} git clone --single-branch --depth=1 -b $TRITON_COMMON_REPO_TAG \ https://github.com/triton-inference-server/common.git -mkdir core && cp common/protobuf/*.proto core/. +bash gen_go_stubs.sh +popd -# Requires protoc and protoc-gen-go plugin: https://github.com/golang/protobuf#installation -# Use "M" arguments since go_package is not specified in .proto files. -# As mentioned here: https://developers.google.com/protocol-buffers/docs/reference/go-generated#package -GO_PACKAGE="nvidia_inferenceserver" -protoc -I core --go_out=plugins=grpc:${PACKAGE_PATH} --go_opt=Mgrpc_service.proto=./${GO_PACKAGE} \ - --go_opt=Mmodel_config.proto=./${GO_PACKAGE} core/*.proto +# Copy packages to GOPATH, where Go expects to find packages. +PACKAGE_PATH="${GOPATH}/src/github.com/triton-inference-server" +rm -rf ${PACKAGE_PATH}/client +mkdir -p ${PACKAGE_PATH} +cp -r client $PACKAGE_PATH set +e -# Runs test for GRPC variant of go client +# Run test for GRPC variant of go client GO111MODULE=off go run $SIMPLE_GO_CLIENT >>client.log 2>&1 if [ $? -ne 0 ]; then RET=1 diff --git a/qa/L0_simple_nodejs_client/test.sh b/qa/L0_simple_nodejs_client/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_socket/test.sh b/qa/L0_socket/test.sh old mode 100644 new mode 100755 index 257976ce96..2fd37bd054 --- a/qa/L0_socket/test.sh +++ b/qa/L0_socket/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -35,7 +35,7 @@ SERVER=/opt/tritonserver/bin/tritonserver SERVER_TIMEOUT=15 source ../common/util.sh -rm -f $CLIENT_LOG $SERVER_LOG +rm -f *.log RET=0 @@ -46,8 +46,8 @@ for address in default explicit; do SAME_EXPLICIT_ADDRESS="" DIFF_EXPLICIT_ADDRESS_ARGS="" else - SAME_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.1" - DIFF_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.2" + SAME_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.1 --metrics-address 127.0.0.1" + DIFF_EXPLICIT_ADDRESS="--http-address 127.0.0.1 --grpc-address 127.0.0.2 --metrics-address 127.0.0.3" fi for p in http grpc; do @@ -138,7 +138,7 @@ for address in default explicit; do kill $SERVER_PID wait $SERVER_PID - # error if http/grpc port overlaps with grpc/http explicit port + # error if http/grpc port overlaps with grpc/http explicit port if [ "$p" == "http" ]; then SERVER_ARGS="--model-repository=$DATADIR $SAME_EXPLICIT_ADDRESS --http-port 8003 --grpc-port 8003" run_server_nowait @@ -302,6 +302,112 @@ for address in default explicit; do done done +# Test multiple servers binding to the same http/grpc port +SERVER0_LOG="./inference_server0.log" +SERVER1_LOG="./inference_server1.log" +SERVER2_LOG="./inference_server2.log" + +for p in http grpc; do + # error if servers bind to the same http/grpc port without setting the reuse flag + if [ "$p" == "http" ]; then + SERVER_ARGS="--model-repository=$DATADIR --metrics-port 8002 --reuse-grpc-port=true" + SERVER0_ARGS="--model-repository=$DATADIR --metrics-port 8003 --reuse-grpc-port=true" + SERVER1_ARGS="--model-repository=$DATADIR --metrics-port 8004 --reuse-grpc-port=true" + else + SERVER_ARGS="--model-repository=$DATADIR --metrics-port 8002 --reuse-http-port=true" + SERVER0_ARGS="--model-repository=$DATADIR --metrics-port 8003 --reuse-http-port=true" + SERVER1_ARGS="--model-repository=$DATADIR --metrics-port 8004 --reuse-http-port=true" + fi + # make sure the first server is launched successfully, then run the other + # two servers and expect them to fail + run_server + run_multiple_servers_nowait 2 + sleep 15 + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start SERVER $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + if [ "$SERVER1_PID" != "0" ]; then + set +e + kill $SERVER0_PID + wait $SERVER0_PID + if [ "$?" == "0" ]; then + echo -e "\n***\n*** unexpected start SERVER0 $SERVER\n***" + cat $SERVER0_LOG + exit 1 + fi + set -e + fi + if [ "$SERVER1_PID" != "0" ]; then + set +e + kill $SERVER1_PID + wait $SERVER1_PID + if [ "$?" == "0" ]; then + echo -e "\n***\n*** unexpected start SERVER1 $SERVER\n***" + cat $SERVER1_LOG + exit 1 + fi + set -e + fi + kill_server + + # 1. Allow multiple servers bind to the same http/grpc port with setting the reuse flag + # 2. Test different forms of setting --metrics-address and verify metrics are queryable + # (a) Test default metrics-address being same as http-address + # (b) Test setting metrics-address explicitly to 0.0.0.0 + # (c) Test setting metrics-address explicitly to 127.0.0.1 + SERVER0_ARGS="--model-repository=$DATADIR --metrics-port 8002 --reuse-http-port=true --reuse-grpc-port=true" + SERVER1_ARGS="--model-repository=$DATADIR --metrics-address 0.0.0.0 --metrics-port 8003 --reuse-http-port=true --reuse-grpc-port=true" + SERVER2_ARGS="--model-repository=$DATADIR --metrics-address 127.0.0.2 --metrics-port 8004 --reuse-http-port=true --reuse-grpc-port=true" + run_multiple_servers_nowait 3 + sleep 15 + if [ "$SERVER0_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start SERVER0 $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + if [ "$SERVER1_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start SERVER1 $SERVER\n***" + cat $SERVER1_LOG + exit 1 + fi + if [ "$SERVER2_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start SERVER2 $SERVER\n***" + cat $SERVER2_LOG + exit 1 + fi + + set +e + + # test if requests are being distributed among three servers + if [ "$p" == "http" ]; then + CLIENT_PY=../clients/simple_http_infer_client.py + else + CLIENT_PY=../clients/simple_grpc_infer_client.py + fi + + pids=() + for i in {0..10}; do + python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 & + pids+=" $!" + done + wait $pids || { echo -e "\n***\n*** Python ${p} Async Infer Test Failed\n***"; cat $CLIENT_LOG; RET=1; } + + set -e + + server0_request_count=`curl -s localhost:8002/metrics | awk '/nv_inference_request_success{/ {print $2}'` + server1_request_count=`curl -s localhost:8003/metrics | awk '/nv_inference_request_success{/ {print $2}'` + server2_request_count=`curl -s 127.0.0.2:8004/metrics | awk '/nv_inference_request_success{/ {print $2}'` + if [ ${server0_request_count%.*} -eq 0 ] || \ + [ ${server1_request_count%.*} -eq 0 ] || \ + [ ${server2_request_count%.*} -eq 0 ]; then + echo -e "\n***\n*** Failed: ${p} requests are not distributed among all servers.\n***" + RET=1 + fi + kill_servers +done + if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else diff --git a/qa/L0_storage_S3/infer_test.py b/qa/L0_storage_S3/infer_test.py deleted file mode 100644 index 9933809b6d..0000000000 --- a/qa/L0_storage_S3/infer_test.py +++ /dev/null @@ -1,174 +0,0 @@ -# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions -# are met: -# * Redistributions of source code must retain the above copyright -# notice, this list of conditions and the following disclaimer. -# * Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and/or other materials provided with the distribution. -# * Neither the name of NVIDIA CORPORATION nor the names of its -# contributors may be used to endorse or promote products derived -# from this software without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY -# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR -# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, -# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR -# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY -# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE -# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -import sys -sys.path.append("../common") - -from builtins import range -from future.utils import iteritems -import unittest -import numpy as np -import infer_util as iu -import test_util as tu -import os - -np_dtype_string = np.dtype(object) - - -class InferTest(tu.TestResultCollector): - - def _full_exact(self, input_dtype, output0_dtype, output1_dtype, - output0_raw, output1_raw, swap): - - def _infer_exact_helper(tester, - pf, - tensor_shape, - batch_size, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=True, - output1_raw=True, - model_version=None, - swap=False, - outputs=("OUTPUT0", "OUTPUT1"), - use_http=True, - use_grpc=True, - skip_request_id_check=False, - use_streaming=True, - correlation_id=0): - for bs in (1, batch_size): - iu.infer_exact(tester, - pf, (bs,) + tensor_shape, - bs, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - model_version=model_version, - swap=swap, - outputs=outputs, - use_http=use_http, - use_grpc=use_grpc, - skip_request_id_check=skip_request_id_check, - use_streaming=use_streaming, - correlation_id=correlation_id) - - input_size = 16 - - if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype, - (input_size,), (input_size,), - (input_size,)): - for pf in ["graphdef", "savedmodel"]: - _infer_exact_helper(self, - pf, (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype, - (input_size, 1, 1), (input_size, 1, 1), - (input_size, 1, 1)): - if input_dtype == np.int8: - _infer_exact_helper(self, - 'plan', (input_size, 1, 1), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - else: - _infer_exact_helper(self, - 'plan', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype, - (input_size,), (input_size,), - (input_size,)): - _infer_exact_helper(self, - 'onnx', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - # Skip for batched string I/O - if tu.validate_for_libtorch_model(input_dtype, output0_dtype, - output1_dtype, (input_size,), - (input_size,), (input_size,), 8): - _infer_exact_helper(self, - 'libtorch', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - def test_raw_fff(self): - self._full_exact(np.float32, - np.float32, - np.float32, - output0_raw=True, - output1_raw=True, - swap=True) - - def test_raw_ooo(self): - self._full_exact(np_dtype_string, - np_dtype_string, - np_dtype_string, - output0_raw=True, - output1_raw=True, - swap=False) - - def test_class_fff(self): - self._full_exact(np.float32, - np.float32, - np.float32, - output0_raw=False, - output1_raw=False, - swap=True) - - -if __name__ == '__main__': - unittest.main() diff --git a/qa/L0_storage_S3/test.sh b/qa/L0_storage_S3/test.sh index 5fe4315dd5..51c8b2ce1e 100755 --- a/qa/L0_storage_S3/test.sh +++ b/qa/L0_storage_S3/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2018-2021, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2018-2023, NVIDIA CORPORATION. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -42,7 +42,7 @@ fi export CUDA_VISIBLE_DEVICES=0 CLIENT_LOG_BASE="./client" -INFER_TEST=infer_test.py +INFER_TEST="../common/infer_test.py" EXPECTED_NUM_TESTS="3" TEST_RESULT_FILE='test_results.txt' @@ -65,6 +65,11 @@ aws s3 mb "${BUCKET_URL}" BUCKET_URL=${BUCKET_URL%/} BUCKET_URL_SLASH="${BUCKET_URL}/" +# Backup S3 credentials as they will be unset during the test +AWS_DEFAULT_REGION_BACKUP=$AWS_DEFAULT_REGION +AWS_ACCESS_KEY_ID_BACKUP=$AWS_ACCESS_KEY_ID +AWS_SECRET_ACCESS_KEY_BACKUP=$AWS_SECRET_ACCESS_KEY + SERVER=/opt/tritonserver/bin/tritonserver SERVER_TIMEOUT=420 @@ -77,7 +82,7 @@ RET=0 # Test 3 Scenarios: # 1. Only AWS ENV vars (Without aws configure) # 2. AWS ENV vars + dummy values in aws configure [ENV vars have higher priority] -# 3. Only aws configure (Without AWS ENV vars) +# 3. Only AWS configured (Without AWS ENV vars) for ENV_VAR in "env" "env_dummy" "config"; do SERVER_LOG=$SERVER_LOG_BASE.$ENV_VAR.log CLIENT_LOG=$CLIENT_LOG_BASE.$ENV_VAR.log @@ -242,6 +247,15 @@ for ENV_VAR in "env" "env_dummy" "config"; do done done +# Restore S3 credentials +rm ~/.aws/credentials && rm ~/.aws/config +export AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION_BACKUP +export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID_BACKUP +export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY_BACKUP +aws configure set default.region $AWS_DEFAULT_REGION && \ + aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \ + aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY + # Test with polling enabled SERVER_ARGS="--model-repository=$ROOT_REPO --exit-timeout-secs=120 --model-control-mode=poll" @@ -278,6 +292,48 @@ set -e kill $SERVER_PID wait $SERVER_PID +# Test localization to a specified location +export TRITON_AWS_MOUNT_DIRECTORY=`pwd`/aws_localization_test + +if [ -d "$TRITON_AWS_MOUNT_DIRECTORY" ]; then + rm -rf $TRITON_AWS_MOUNT_DIRECTORY +fi + +mkdir -p $TRITON_AWS_MOUNT_DIRECTORY + +SERVER_LOG=$SERVER_LOG_BASE.custom_localization.log +SERVER_ARGS="--model-repository=$ROOT_REPO --exit-timeout-secs=120" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +if [ -z "$(ls -A $TRITON_AWS_MOUNT_DIRECTORY)" ]; then + echo -e "\n***\n*** Test localization to a specified location failed. \n***" + echo -e "\n***\n*** Specified mount folder $TRITON_AWS_MOUNT_DIRECTORY is empty \n***" + ls -A $TRITON_AWS_MOUNT_DIRECTORY + exit 1 +fi + +kill $SERVER_PID +wait $SERVER_PID + +if [ -d "$TRITON_AWS_MOUNT_DIRECTORY" ] && [ ! -z "$(ls -A $TRITON_AWS_MOUNT_DIRECTORY)" ]; then + echo -e "\n***\n*** Test localization to a specified location failed. \n***" + echo -e "\n***\n*** Specified mount folder $TRITON_AWS_MOUNT_DIRECTORY was not cleared properly. \n***" + ls -A $TRITON_AWS_MOUNT_DIRECTORY + exit 1 +fi + +rm -rf $TRITON_AWS_MOUNT_DIRECTORY +unset TRITON_AWS_MOUNT_DIRECTORY + +# Save models for AWS_SESSION_TOKEN test +rm -rf tmp_cred_test_models +mv models tmp_cred_test_models # Clean up bucket contents aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" @@ -323,6 +379,143 @@ fi kill $SERVER_PID wait $SERVER_PID +# Clean up bucket contents +aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" + +# Test with temporary credential (AWS_SESSION_TOKEN) +AWS_GET_SESSION_TOKEN_RES=`aws sts get-session-token --duration-seconds 900` && \ + export AWS_ACCESS_KEY_ID=`echo $AWS_GET_SESSION_TOKEN_RES | jq -r ".Credentials.AccessKeyId"` && \ + export AWS_SECRET_ACCESS_KEY=`echo $AWS_GET_SESSION_TOKEN_RES | jq -r ".Credentials.SecretAccessKey"` && \ + export AWS_SESSION_TOKEN=`echo $AWS_GET_SESSION_TOKEN_RES | jq -r ".Credentials.SessionToken"` +rm ~/.aws/credentials && rm ~/.aws/config +aws configure set default.region $AWS_DEFAULT_REGION && \ + aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \ + aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY && \ + aws configure set aws_session_token $AWS_SESSION_TOKEN + +# Copy models into S3 bucket +aws s3 cp tmp_cred_test_models/ "${BUCKET_URL_SLASH}" --recursive --include "*" + +SERVER_LOG=$SERVER_LOG_BASE.temporary_credentials_test.log +SERVER_ARGS="--model-repository=$BUCKET_URL --exit-timeout-secs=120" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e + +python $INFER_TEST >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi + +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# Test access decline +export AWS_SECRET_ACCESS_KEY="[Invalid]" && export AWS_SESSION_TOKEN="" +SERVER_LOG=$SERVER_LOG_BASE.access_decline_test.log +SERVER_ARGS="--model-repository=$BUCKET_URL --exit-timeout-secs=120" +run_server +if [ "$SERVER_PID" != "0" ]; then + echo -e "\n***\n*** Unexpected server start $SERVER\n***" + cat $SERVER_LOG + kill $SERVER_PID + wait $SERVER_PID + RET=1 +else + # AWS S3 does not appear to reply on access decline, but other implementations + # might provide extra messages, so make sure Triton will print the messages. + EXPECTED_MSG="Unable to create S3 filesystem client. Check account credentials. Exception: '' Message: 'No response body.'" + if ! grep "$EXPECTED_MSG" $SERVER_LOG; then + echo -e "\n***\n*** Expected error message not found\n***" + cat $SERVER_LOG + RET=1 + fi +fi + +# Restore S3 credentials +rm ~/.aws/credentials && rm ~/.aws/config +export AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION_BACKUP +export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID_BACKUP +export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY_BACKUP +aws configure set default.region $AWS_DEFAULT_REGION && \ + aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID && \ + aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY + +# Clean up bucket contents +aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" + +# Test case where S3 folder has >1000 files +rm -rf models + +mkdir -p models/model/1 +# Create Python model that reads the number of files in the +# model directory when loaded +echo "import os + +class TritonPythonModel: + + def initialize(self, args): + count = 0 + model_dir = args['model_repository'] + for path in os.listdir(model_dir): + if os.path.isfile(os.path.join(model_dir, path)): + count += 1 + print('Found {} files in model directory'.format(count)) + + def execute(self): + pass" > models/model/1/model.py + +for i in {1..1050}; do + touch models/model/0${i}.txt +done + +# Provide extended timeout to allow >1000 files to be loaded +SERVER_ARGS="--model-repository=$BUCKET_URL --exit-timeout-secs=600 --model-control-mode=none" +SERVER_LOG=$SERVER_LOG_BASE.many_files.log + +# copy contents of /models into S3 bucket and wait for them to be loaded. +aws s3 cp models/ "${BUCKET_URL_SLASH}" --recursive --include "*" + +# Test that the server starts up. Files will be loaded in numerically +# ascending order, so the model file is loaded after the first 1000 +# files. If AWS fails to load >1000 files, the model file will not +# be loaded and the server will fail to start. + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +kill $SERVER_PID +wait $SERVER_PID + +# Confirm the correct number of files loaded +EXPECTED_MSG="Found 1050 files in model directory" +if ! grep "$EXPECTED_MSG" $SERVER_LOG; then +echo -e "\n***\n*** Expected file count message not found\n***" +cat $SERVER_LOG +RET=1 +fi + # Clean up bucket contents and delete bucket aws s3 rm "${BUCKET_URL_SLASH}" --recursive --include "*" aws s3 rb "${BUCKET_URL}" diff --git a/qa/L0_storage_S3_local/mock_s3_service.py b/qa/L0_storage_S3_local/mock_s3_service.py new file mode 100755 index 0000000000..956aac0e66 --- /dev/null +++ b/qa/L0_storage_S3_local/mock_s3_service.py @@ -0,0 +1,113 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import threading +import time +from http.server import BaseHTTPRequestHandler, HTTPServer + + +class MockS3Service: + __address = "localhost" + __port = 8080 + + def __init__(self): + # Test passed when: + # - at least one HEAD request is received; and + # - at least one GET request is received; and + # - all received requests do not advertise for HTTP/2. + test_results = {"head_count": 0, "get_count": 0, "http2_ads": False} + + class RequestValidator(BaseHTTPRequestHandler): + protocol_version = "HTTP/1.1" + + def __CheckHttp2Ads(self): + if "connection" in self.headers: + v = self.headers["connection"].lower() + if "upgrade" in v or "http2" in v: + test_results["http2_ads"] = True + if ( + "upgrade" in self.headers + and "h2c" in self.headers["upgrade"].lower() + ): + test_results["http2_ads"] = True + if "http2-settings" in self.headers: + test_results["http2_ads"] = True + + def do_HEAD(self): + self.__CheckHttp2Ads() + test_results["head_count"] += 1 + self.send_response(200) + self.end_headers() + + def do_GET(self): + self.__CheckHttp2Ads() + test_results["get_count"] += 1 + self.send_error( + 404, + "Thank you for using the mock s3 service!", + "Your bucket is not found here!", + ) + + self.__test_results = test_results + self.__server = HTTPServer((self.__address, self.__port), RequestValidator) + self.__service_thread = threading.Thread(target=self.__server.serve_forever) + + def __enter__(self): + self.__service_thread.start() + + def __exit__(self, exc_type, exc_val, exc_tb): + self.__server.shutdown() + self.__server.server_close() + self.__service_thread.join() + + def TestPassed(self): + return ( + self.__test_results["head_count"] > 0 + and self.__test_results["get_count"] > 0 + and not self.__test_results["http2_ads"] + ) + + +if __name__ == "__main__": + # Initialize mock service + mock_s3_service = MockS3Service() + + # Start service and poll until test passed or timed-out + with mock_s3_service: + poll_interval = 1 # seconds + timeout = 10 # seconds + elapsed_time = 0 # seconds + while not mock_s3_service.TestPassed() and elapsed_time < timeout: + elapsed_time += poll_interval + time.sleep(poll_interval) + + # Print the result + if mock_s3_service.TestPassed(): + print("TEST PASSED") + else: + print("TEST FAILED") diff --git a/qa/L0_s3_local/test.sh b/qa/L0_storage_S3_local/test.sh old mode 100644 new mode 100755 similarity index 64% rename from qa/L0_s3_local/test.sh rename to qa/L0_storage_S3_local/test.sh index eee5495971..e60b106b31 --- a/qa/L0_s3_local/test.sh +++ b/qa/L0_storage_S3_local/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -41,17 +41,63 @@ fi export CUDA_VISIBLE_DEVICES=0 CLIENT_LOG="./client.log" -PERF_CLIENT=../clients/perf_client +TEST_RESULT_FILE='test_results.txt' +INFER_TEST="../common/infer_test.py" +EXPECTED_NUM_TESTS="3" DATADIR="/data/inferenceserver/${REPO_VERSION}/qa_model_repository" -BACKENDS="graphdef libtorch onnx plan savedmodel" +# Used to control which backends are run in infer_test.py +BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan"} -rm -rf models && mkdir models -for BACKEND in $BACKENDS; do - cp -r $DATADIR/${BACKEND}_float32_float32_float32 models/. - # Remove version policy from config.pbtxt - sed -i '/^version_policy/d' models/${BACKEND}_float32_float32_float32/config.pbtxt -done +function run_unit_tests() { + echo "Running unit tests: ${INFER_TEST}" + python $INFER_TEST >$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi +} + +function setup_model_repo() { + model_repo=${1:-"models"} + backends=${2:-${BACKENDS}} + types=${3:-"float32_float32_float32 object_object_object"} + echo "[setup_model_repo] model_repo: ${model_repo}, backends: ${backends}" + rm -rf ${model_repo} && mkdir ${model_repo} + for BACKEND in ${backends}; do + for TYPE in ${types}; do + model="${BACKEND}_${TYPE}" + echo "Copying ${DATADIR}/${model} to ${model_repo}." + cp -r "${DATADIR}/${model}" "${model_repo}/" + # Remove version policy from config.pbtxt + sed -i '/^version_policy/d' ${model_repo}/${model}/config.pbtxt + done + done +} + +function load_models() { + model_repo=${1:-"models"} + for model in `ls ${model_repo}`; do + echo "Loading model: ${model}" + code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${model}/load` + if [ "$code" != "200" ]; then + echo -e "\n***\n*** Test Failed. Failed to load model: ${model}\n***" + RET=1 + fi + done +} + +set +e +setup_model_repo +set -e # Create model with name that has all types of allowed characters DUMMY_MODEL="Model_repo-1.0" @@ -75,7 +121,7 @@ export MINIO_ACCESS_KEY="minio" # https://github.com/minio/minio/issues/15030 export MINIO_CI_CD=true MINIO_VOLUMES="/usr/local/share/minio/" -MINIO_OPTS="-C /etc/minio --address localhost:4572" +MINIO_OPTS="-C /etc/minio --address 127.0.0.1:4572" export MINIO_SECRET_KEY="miniostorage" (curl -O https://raw.githubusercontent.com/minio/minio-service/master/linux-systemd/minio.service && \ @@ -105,6 +151,7 @@ awslocal $ENDPOINT_FLAG s3 mb s3://demo-bucket1.0 && \ RET=0 # Test with hostname and IP address +echo "=== Running hostname/IP tests ===" for HOST in "127.0.0.1" "localhost"; do SERVER_ARGS="--model-repository=s3://$HOST:4572/demo-bucket1.0 --model-control-mode=explicit" if [ "$HOST" = "127.0.0.1" ]; then @@ -124,20 +171,8 @@ for HOST in "127.0.0.1" "localhost"; do fi set +e - for BACKEND in $BACKENDS; do - code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${BACKEND}_float32_float32_float32/load` - if [ "$code" != "200" ]; then - echo -e "\n***\n*** Test Failed\n***" - RET=1 - fi - - $PERF_CLIENT -m ${BACKEND}_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1 - if [ $? -ne 0 ]; then - echo -e "\n***\n*** Test Failed\n***" - cat $CLIENT_LOG - RET=1 - fi - done + load_models + run_unit_tests # Try to load model with name that checks for all types of allowed characters code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${DUMMY_MODEL}/load` @@ -152,6 +187,7 @@ for HOST in "127.0.0.1" "localhost"; do done # Test with Polling +echo "=== Running polling tests ===" SERVER_ARGS="--model-repository=s3://localhost:4572/demo-bucket1.0 --model-control-mode=poll" SERVER_LOG="./inference_server_poll.log" @@ -170,7 +206,7 @@ awslocal $ENDPOINT_FLAG s3 sync models s3://demo-bucket1.0 sleep 20 -set + e +set +e CURL_LOG=$(curl -X POST localhost:8000/v2/repository/index) if [[ "$CURL_LOG" != *"{\"name\":\"libtorch_float32_float32_float32\",\"version\":\"3\",\"state\":\"UNAVAILABLE\",\"reason\":\"unloaded\"}"* ]]; then echo -e "\n***\n*** Failed. Server did not unload libtorch_float32_float32_float32 version 3\n***" @@ -191,9 +227,26 @@ awslocal $ENDPOINT_FLAG s3 rm s3://demo-bucket1.0 --recursive --include "*" && \ awslocal $ENDPOINT_FLAG s3 rb s3://demo-bucket1.0 # Test with Polling, no model configuration file - with strict model config disabled -rm -rf models && mkdir models -cp -r $DATADIR/savedmodel_float32_float32_float32 models/. -rm models/savedmodel_float32_float32_float32/config.pbtxt +echo "=== Running autocomplete tests ===" +AUTOCOMPLETE_BACKENDS="savedmodel" +export BACKENDS=${AUTOCOMPLETE_BACKENDS} + +set +e +setup_model_repo + +TYPES="float32_float32_float32 object_object_object" +for BACKEND in ${AUTOCOMPLETE_BACKENDS}; do + for TYPE in ${TYPES}; do + model="${BACKEND}_${TYPE}" + # Config files specify things expected by unit test like label_filename + # and max_batch_size for comparing results, so remove some key fields + # for autocomplete to fill that won't break the unit test. + sed -i '/platform:/d' models/${model}/config.pbtxt + sed -i '/data_type:/d' models/${model}/config.pbtxt + sed -i '/dims:/d' models/${model}/config.pbtxt + done +done +set -e awslocal $ENDPOINT_FLAG s3 mb s3://demo-bucket1.0 && \ awslocal $ENDPOINT_FLAG s3 sync models s3://demo-bucket1.0 @@ -211,12 +264,7 @@ if [ "$SERVER_PID" == "0" ]; then exit 1 fi -$PERF_CLIENT -m savedmodel_float32_float32_float32 -p 3000 -t 1 > $CLIENT_LOG 2>&1 -if [ $? -ne 0 ]; then - echo -e "\n***\n*** Test Failed\n***" - cat $CLIENT_LOG - RET=1 -fi +run_unit_tests kill $SERVER_PID wait $SERVER_PID @@ -226,23 +274,15 @@ awslocal $ENDPOINT_FLAG s3 rm s3://demo-bucket1.0 --recursive --include "*" && \ awslocal $ENDPOINT_FLAG s3 rb s3://demo-bucket1.0 # Test for multiple model repositories using S3 cloud storage +echo "=== Running multiple-model-repository tests ===" BACKENDS1="graphdef libtorch" BACKENDS2="onnx plan savedmodel" -BACKENDS="$BACKENDS1 $BACKENDS2" - -rm -rf models1 && mkdir models1 -for BACKEND in $BACKENDS1; do - cp -r $DATADIR/${BACKEND}_float32_float32_float32 models1/. - # Remove version policy from config.pbtxt - sed -i '/^version_policy/d' models1/${BACKEND}_float32_float32_float32/config.pbtxt -done +export BACKENDS="$BACKENDS1 $BACKENDS2" -rm -rf models2 && mkdir models2 -for BACKEND in $BACKENDS2; do - cp -r $DATADIR/${BACKEND}_float32_float32_float32 models2/. - # Remove version policy from config.pbtxt - sed -i '/^version_policy/d' models2/${BACKEND}_float32_float32_float32/config.pbtxt -done +set +e +setup_model_repo "models1" "${BACKENDS1}" +setup_model_repo "models2" "${BACKENDS2}" +set -e BUCKET_NAME="demo-bucket" MODEL_REPO_ARGS="" @@ -272,25 +312,39 @@ if [ "$SERVER_PID" == "0" ]; then fi set +e -for BACKEND in $BACKENDS; do - code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${BACKEND}_float32_float32_float32/load` - if [ "$code" != "200" ]; then - echo -e "\n***\n*** Test Failed\n***" - RET=1 - fi - - $PERF_CLIENT -m ${BACKEND}_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1 - if [ $? -ne 0 ]; then - echo -e "\n***\n*** Test Failed\n***" - cat $CLIENT_LOG - RET=1 - fi -done +load_models "models1" +load_models "models2" +run_unit_tests set -e kill $SERVER_PID wait $SERVER_PID +# Test access decline +AWS_SECRET_ACCESS_KEY_BACKUP=$AWS_SECRET_ACCESS_KEY +export AWS_SECRET_ACCESS_KEY="[Invalid]" +SERVER_ARGS="--model-repository=s3://localhost:4572/${BUCKET_NAME}1 --exit-timeout-secs=120" +SERVER_LOG="./inference_server.access_decline.log" +run_server +if [ "$SERVER_PID" != "0" ]; then + echo -e "\n***\n*** Unexpected server start $SERVER\n***" + cat $SERVER_LOG + kill $SERVER_PID + wait $SERVER_PID + RET=1 +else + # MinIO does not appear to reply on access decline, but other implementations + # might provide extra messages, so make sure Triton will print the messages. + EXPECTED_MSG="Unable to create S3 filesystem client. Check account credentials. Exception: '' Message: 'No response body.'" + if ! grep "$EXPECTED_MSG" $SERVER_LOG; then + echo -e "\n***\n*** Expected error message not found\n***" + cat $SERVER_LOG + RET=1 + fi +fi +# Restore keys for destroying buckets +export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY_BACKUP + # Destroy buckets for BUCKET_SUFFIX in 1 2; do awslocal $ENDPOINT_FLAG s3 rm s3://$BUCKET_NAME$BUCKET_SUFFIX --recursive --include "*" && \ @@ -301,10 +355,33 @@ done kill $MINIO_PID wait $MINIO_PID -if [ $RET -eq 0 ]; then - echo -e "\n***\n*** Test Passed\n***" +# Test the S3 client will not advertise HTTP/2 +TEST_LOG="./http2_advertise_test.log" +python3 mock_s3_service.py > $TEST_LOG 2>&1 & +sleep 2 # make sure the mock service has started +SERVER_LOG="./http2_advertise_test.server.log" +SERVER_ARGS="--model-repository=s3://localhost:8080/dummy-bucket --exit-timeout-secs=120" +run_server +if [ "$SERVER_PID" != "0" ]; then + echo -e "\n***\n*** Unexpected server start $SERVER\n***" + cat $SERVER_LOG + kill $SERVER_PID + wait $SERVER_PID + RET=1 else - echo -e "\n***\n*** Test Failed\n***" + sleep 2 # make sure the mock service has stopped + PASSED_MSG="TEST PASSED" + if ! grep "$PASSED_MSG" $TEST_LOG; then + echo -e "\n***\n*** S3 client HTTP/2 advertise test failed\n***" + cat $TEST_LOG + RET=1 + fi fi +# Print and return test result +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test Failed\n***" +fi exit $RET diff --git a/qa/L0_storage_azure/infer_test.py b/qa/L0_storage_azure/infer_test.py deleted file mode 100644 index 372adb2132..0000000000 --- a/qa/L0_storage_azure/infer_test.py +++ /dev/null @@ -1,174 +0,0 @@ -# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Redistribution and use in source and binary forms, with or without -# modification, are permitted provided that the following conditions -# are met: -# * Redistributions of source code must retain the above copyright -# notice, this list of conditions and the following disclaimer. -# * Redistributions in binary form must reproduce the above copyright -# notice, this list of conditions and the following disclaimer in the -# documentation and/or other materials provided with the distribution. -# * Neither the name of NVIDIA CORPORATION nor the names of its -# contributors may be used to endorse or promote products derived -# from this software without specific prior written permission. -# -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY -# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR -# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, -# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR -# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY -# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE -# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - -import sys -sys.path.append("../common") - -from builtins import range -from future.utils import iteritems -import unittest -import numpy as np -import infer_util as iu -import test_util as tu -import os - -np_dtype_string = np.dtype(object) - - -class InferTest(tu.TestResultCollector): - - def _full_exact(self, input_dtype, output0_dtype, output1_dtype, - output0_raw, output1_raw, swap): - - def _infer_exact_helper(tester, - pf, - tensor_shape, - batch_size, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=True, - output1_raw=True, - model_version=None, - swap=False, - outputs=("OUTPUT0", "OUTPUT1"), - use_http=True, - use_grpc=True, - skip_request_id_check=False, - use_streaming=True, - correlation_id=0): - for bs in (1, batch_size): - iu.infer_exact(tester, - pf, (bs,) + tensor_shape, - bs, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - model_version=model_version, - swap=swap, - outputs=outputs, - use_http=use_http, - use_grpc=use_grpc, - skip_request_id_check=skip_request_id_check, - use_streaming=use_streaming, - correlation_id=correlation_id) - - input_size = 16 - - if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype, - (input_size,), (input_size,), - (input_size,)): - for pf in ["graphdef", "savedmodel"]: - _infer_exact_helper(self, - pf, (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype, - (input_size, 1, 1), (input_size, 1, 1), - (input_size, 1, 1)): - if input_dtype == np.int8: - _infer_exact_helper(self, - 'plan', (input_size, 1, 1), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - else: - _infer_exact_helper(self, - 'plan', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype, - (input_size,), (input_size,), - (input_size,)): - _infer_exact_helper(self, - 'onnx', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - # Skip for batched string I/O - if tu.validate_for_libtorch_model(input_dtype, output0_dtype, - output1_dtype, (input_size,), - (input_size,), (input_size,), 8): - _infer_exact_helper(self, - 'libtorch', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - def test_raw_fff(self): - self._full_exact(np.float32, - np.float32, - np.float32, - output0_raw=True, - output1_raw=True, - swap=True) - - def test_raw_ooo(self): - self._full_exact(np_dtype_string, - np_dtype_string, - np_dtype_string, - output0_raw=True, - output1_raw=True, - swap=False) - - def test_class_fff(self): - self._full_exact(np.float32, - np.float32, - np.float32, - output0_raw=False, - output1_raw=False, - swap=True) - - -if __name__ == '__main__': - unittest.main() diff --git a/qa/L0_storage_azure/test.sh b/qa/L0_storage_azure/test.sh index 0bc44c9a60..15f9c78bcc 100755 --- a/qa/L0_storage_azure/test.sh +++ b/qa/L0_storage_azure/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -55,9 +55,8 @@ ACCOUNT_NAME=$AZURE_STORAGE_ACCOUNT ACCOUNT_KEY=$AZURE_STORAGE_KEY export CUDA_VISIBLE_DEVICES=0 CLIENT_LOG_BASE="./client" -INFER_TEST=infer_test.py +INFER_TEST="../common/infer_test.py" EXPECTED_NUM_TESTS="3" -PERF_CLIENT=../clients/perf_client timestamp=$(date +%s) CONTAINER_NAME="tritonqatest${timestamp}" @@ -82,19 +81,38 @@ source ../common/util.sh rm -f $SERVER_LOG_BASE* $CLIENT_LOG_BASE* RET=0 -BACKENDS="graphdef savedmodel onnx libtorch plan" +# Used to control which backends are run in infer_test.py +BACKENDS=${BACKENDS:="graphdef savedmodel onnx libtorch plan"} -# Construct model repository -mkdir -p models -for FW in $BACKENDS; do - cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/${FW}_float32_float32_float32 models/ -done +function run_unit_tests() { + BACKENDS=$BACKENDS python $INFER_TEST >$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 + else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi +} -# Copy models with string inputs and remove nobatch (bs=1) models -cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/*_object_object_object models/ +function setup_model_repo() { + # Construct model repository + rm -rf models && mkdir -p models + for FW in $BACKENDS; do + cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/${FW}_float32_float32_float32 models/ + done -rm -rf models/*nobatch* + # Copy models with string inputs and remove nobatch (bs=1) models + cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/*_object_object_object models/ + rm -rf models/*nobatch* +} +setup_model_repo KIND="KIND_GPU" for FW in $BACKENDS; do for MC in `ls models/${FW}*/config.pbtxt`; do @@ -144,27 +162,52 @@ for ENV_VAR in "shared_key"; do fi set +e - - python $INFER_TEST >$CLIENT_LOG 2>&1 - if [ $? -ne 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** Test Failed\n***" - RET=1 - else - check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS - if [ $? -ne 0 ]; then - cat $CLIENT_LOG - echo -e "\n***\n*** Test Result Verification Failed\n***" - RET=1 - fi - fi - + run_unit_tests set -e kill $SERVER_PID wait $SERVER_PID done +# Test localization to a specified location +export TRITON_AZURE_MOUNT_DIRECTORY=`pwd`/azure_localization_test + +if [ -d "$TRITON_AZURE_MOUNT_DIRECTORY" ]; then + rm -rf $TRITON_AZURE_MOUNT_DIRECTORY +fi + +mkdir -p $TRITON_AZURE_MOUNT_DIRECTORY + +SERVER_LOG=$SERVER_LOG_BASE.custom_localization.log +SERVER_ARGS="--model-repository=$MODEL_REPO --exit-timeout-secs=120" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +if [ -z "$(ls -A $TRITON_AZURE_MOUNT_DIRECTORY)" ]; then + echo -e "\n***\n*** Test localization to a specified location failed. \n***" + echo -e "\n***\n*** Specified mount folder $TRITON_AZURE_MOUNT_DIRECTORY is empty \n***" + ls -A $TRITON_AZURE_MOUNT_DIRECTORY + exit 1 +fi + +kill $SERVER_PID +wait $SERVER_PID + +if [ -d "$TRITON_AZURE_MOUNT_DIRECTORY" ] && [ ! -z "$(ls -A $TRITON_AZURE_MOUNT_DIRECTORY)" ]; then + echo -e "\n***\n*** Test localization to a specified location failed. \n***" + echo -e "\n***\n*** Specified mount folder $TRITON_AZURE_MOUNT_DIRECTORY was not cleared properly. \n***" + ls -A $TRITON_AZURE_MOUNT_DIRECTORY + exit 1 +fi + +rm -rf $TRITON_AZURE_MOUNT_DIRECTORY +unset TRITON_AZURE_MOUNT_DIRECTORY + # Add test for explicit model control SERVER_LOG=$SERVER_LOG_BASE.explicit.log CLIENT_LOG=$CLIENT_LOG_BASE.explicit.log @@ -179,20 +222,16 @@ if [ "$SERVER_PID" == "0" ]; then fi set +e -for BACKEND in $BACKENDS; do - code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${BACKEND}_float32_float32_float32/load` +for model in `ls models/`; do + code=`curl -s -w %{http_code} -X POST localhost:8000/v2/repository/models/${model}/load` if [ "$code" != "200" ]; then echo -e "\n***\n*** Test Failed\n***" RET=1 fi - - $PERF_CLIENT -m ${BACKEND}_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1 - if [ $? -ne 0 ]; then - echo -e "\n***\n*** Test Failed\n***" - cat $CLIENT_LOG - RET=1 - fi done + +# Check that each explicitly loaded model runs correctly +run_unit_tests set -e kill $SERVER_PID @@ -211,9 +250,20 @@ SERVER_ARGS="--model-repository=${AS_URL}/models --model-control-mode=poll --str az storage container create --name ${CONTAINER_NAME} --account-name ${ACCOUNT_NAME} --account-key ${ACCOUNT_KEY} sleep 10 +# Setup model repository with minimal configs to be autocompleted rm -rf models && mkdir -p models -cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/savedmodel_float32_float32_float32 models/ -rm models/savedmodel_float32_float32_float32/config.pbtxt +AUTOCOMPLETE_BACKENDS="savedmodel" +for FW in ${AUTOCOMPLETE_BACKENDS}; do + for model in ${FW}_float32_float32_float32 ${FW}_object_object_object; do + cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/${model} models/ + # Config files specify things expected by unit test like label_filename + # and max_batch_size for comparing results, so remove some key fields + # for autocomplete to fill that won't break the unit test. + sed -i '/platform:/d' models/${model}/config.pbtxt + sed -i '/data_type:/d' models/${model}/config.pbtxt + sed -i '/dims:/d' models/${model}/config.pbtxt + done +done # copy contents of models into container. for file in `find models -type f` ;do @@ -229,12 +279,9 @@ if [ "$SERVER_PID" == "0" ]; then fi set +e -$PERF_CLIENT -m savedmodel_float32_float32_float32 -p 3000 -t 1 >$CLIENT_LOG 2>&1 -if [ $? -ne 0 ]; then - echo -e "\n***\n*** Test Failed\n***" - cat $CLIENT_LOG - RET=1 -fi +# Check that each polled model runs correctly +export BACKENDS="${AUTOCOMPLETE_BACKENDS}" +run_unit_tests set -e kill $SERVER_PID diff --git a/qa/L0_storage_swiftstack/infer_test.py b/qa/L0_storage_swiftstack/infer_test.py old mode 100644 new mode 100755 index db2499782c..f8a65a01a4 --- a/qa/L0_storage_swiftstack/infer_test.py +++ b/qa/L0_storage_swiftstack/infer_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,139 +27,181 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range -from future.utils import iteritems import unittest -import numpy as np + import infer_util as iu +import numpy as np import test_util as tu -import os class InferTest(tu.TestResultCollector): - - def _full_exact(self, input_dtype, output0_dtype, output1_dtype, - output0_raw, output1_raw, swap): - - def _infer_exact_helper(tester, - pf, - tensor_shape, - batch_size, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=True, - output1_raw=True, - model_version=None, - swap=False, - outputs=("OUTPUT0", "OUTPUT1"), - use_http=True, - use_grpc=True, - skip_request_id_check=False, - use_streaming=True, - correlation_id=0): + def _full_exact( + self, input_dtype, output0_dtype, output1_dtype, output0_raw, output1_raw, swap + ): + def _infer_exact_helper( + tester, + pf, + tensor_shape, + batch_size, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=True, + output1_raw=True, + model_version=None, + swap=False, + outputs=("OUTPUT0", "OUTPUT1"), + use_http=True, + use_grpc=True, + skip_request_id_check=False, + use_streaming=True, + correlation_id=0, + ): for bs in (1, batch_size): - iu.infer_exact(tester, - pf, (bs,) + tensor_shape, - bs, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - model_version=model_version, - swap=swap, - outputs=outputs, - use_http=use_http, - use_grpc=use_grpc, - skip_request_id_check=skip_request_id_check, - use_streaming=use_streaming, - correlation_id=correlation_id) + iu.infer_exact( + tester, + pf, + (bs,) + tensor_shape, + bs, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + model_version=model_version, + swap=swap, + outputs=outputs, + use_http=use_http, + use_grpc=use_grpc, + skip_request_id_check=skip_request_id_check, + use_streaming=use_streaming, + correlation_id=correlation_id, + ) input_size = 16 - if tu.validate_for_tf_model(input_dtype, output0_dtype, output1_dtype, - (input_size,), (input_size,), - (input_size,)): + if tu.validate_for_tf_model( + input_dtype, + output0_dtype, + output1_dtype, + (input_size,), + (input_size,), + (input_size,), + ): for pf in ["graphdef", "savedmodel"]: - _infer_exact_helper(self, - pf, (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - if tu.validate_for_trt_model(input_dtype, output0_dtype, output1_dtype, - (input_size, 1, 1), (input_size, 1, 1), - (input_size, 1, 1)): + _infer_exact_helper( + self, + pf, + (input_size,), + 8, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + swap=swap, + ) + + if tu.validate_for_trt_model( + input_dtype, + output0_dtype, + output1_dtype, + (input_size, 1, 1), + (input_size, 1, 1), + (input_size, 1, 1), + ): if input_dtype == np.int8: - _infer_exact_helper(self, - 'plan', (input_size, 1, 1), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) + _infer_exact_helper( + self, + "plan", + (input_size, 1, 1), + 8, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + swap=swap, + ) else: - _infer_exact_helper(self, - 'plan', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - if tu.validate_for_onnx_model(input_dtype, output0_dtype, output1_dtype, - (input_size,), (input_size,), - (input_size,)): - _infer_exact_helper(self, - 'onnx', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) - - if tu.validate_for_libtorch_model(input_dtype, output0_dtype, - output1_dtype, (input_size,), - (input_size,), (input_size,)): - _infer_exact_helper(self, - 'libtorch', (input_size,), - 8, - input_dtype, - output0_dtype, - output1_dtype, - output0_raw=output0_raw, - output1_raw=output1_raw, - swap=swap) + _infer_exact_helper( + self, + "plan", + (input_size,), + 8, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + swap=swap, + ) + + if tu.validate_for_onnx_model( + input_dtype, + output0_dtype, + output1_dtype, + (input_size,), + (input_size,), + (input_size,), + ): + _infer_exact_helper( + self, + "onnx", + (input_size,), + 8, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + swap=swap, + ) + + if tu.validate_for_libtorch_model( + input_dtype, + output0_dtype, + output1_dtype, + (input_size,), + (input_size,), + (input_size,), + ): + _infer_exact_helper( + self, + "libtorch", + (input_size,), + 8, + input_dtype, + output0_dtype, + output1_dtype, + output0_raw=output0_raw, + output1_raw=output1_raw, + swap=swap, + ) def test_raw_fff(self): - self._full_exact(np.float32, - np.float32, - np.float32, - output0_raw=True, - output1_raw=True, - swap=True) + self._full_exact( + np.float32, + np.float32, + np.float32, + output0_raw=True, + output1_raw=True, + swap=True, + ) def test_class_fff(self): - self._full_exact(np.float32, - np.float32, - np.float32, - output0_raw=False, - output1_raw=False, - swap=True) + self._full_exact( + np.float32, + np.float32, + np.float32, + output0_raw=False, + output1_raw=False, + swap=True, + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_string_io/string_client_test.py b/qa/L0_string_io/string_client_test.py old mode 100644 new mode 100755 index b012ce87af..16112ac70c --- a/qa/L0_string_io/string_client_test.py +++ b/qa/L0_string_io/string_client_test.py @@ -1,5 +1,5 @@ #!/usr/bin/env python -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,27 +26,26 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys -sys.path.append('../common') -import argparse -import numpy as np -import os +sys.path.append("../common") + +import unittest from builtins import range -import tritonclient.http as tritonhttpclient + +import numpy as np +import test_util as tu import tritonclient.grpc as tritongrpcclient +import tritonclient.http as tritonhttpclient import tritonclient.utils as tritonutils -import unittest -import test_util as tu class ClientStringTest(tu.TestResultCollector): - def _test_infer_unicode(self, model_name, client, input_): # Send inference request to the inference server. Get results for # both output tensors. inputs = [] outputs = [] - inputs.append(client[1].InferInput('INPUT0', input_.shape, "BYTES")) + inputs.append(client[1].InferInput("INPUT0", input_.shape, "BYTES")) if client[1] == tritonhttpclient: inputs[0].set_data_from_numpy(input_, client[3]) @@ -54,31 +53,26 @@ def _test_infer_unicode(self, model_name, client, input_): inputs[0].set_data_from_numpy(input_) if client[1] == tritonhttpclient: - outputs.append(client[1].InferRequestedOutput( - 'OUTPUT0', binary_data=client[2])) + outputs.append( + client[1].InferRequestedOutput("OUTPUT0", binary_data=client[2]) + ) else: - outputs.append(client[1].InferRequestedOutput('OUTPUT0')) + outputs.append(client[1].InferRequestedOutput("OUTPUT0")) - results = client[0].infer(model_name=model_name, - inputs=inputs, - outputs=outputs) + results = client[0].infer(model_name=model_name, inputs=inputs, outputs=outputs) - out0 = results.as_numpy('OUTPUT0') + out0 = results.as_numpy("OUTPUT0") # We expect there to be 1 results (with batch-size 1). Verify # that all 8 result elements are the same as the input. self.assertTrue(np.array_equal(input_, out0)) return out0 - def _test_infer_non_unicode(self, - model_name, - client, - input_, - binary_data=True): + def _test_infer_non_unicode(self, model_name, client, input_, binary_data=True): # Send inference request to the inference server. Get results for # both output tensors. inputs = [] outputs = [] - inputs.append(client[1].InferInput('INPUT0', input_.shape, "BYTES")) + inputs.append(client[1].InferInput("INPUT0", input_.shape, "BYTES")) if client[1] == tritonhttpclient: inputs[0].set_data_from_numpy(input_, client[3]) @@ -86,57 +80,58 @@ def _test_infer_non_unicode(self, inputs[0].set_data_from_numpy(input_) if client[1] == tritonhttpclient: - outputs.append(client[1].InferRequestedOutput( - 'OUTPUT0', binary_data=client[2])) + outputs.append( + client[1].InferRequestedOutput("OUTPUT0", binary_data=client[2]) + ) else: - outputs.append(client[1].InferRequestedOutput('OUTPUT0')) + outputs.append(client[1].InferRequestedOutput("OUTPUT0")) - results = client[0].infer(model_name=model_name, - inputs=inputs, - outputs=outputs) + results = client[0].infer(model_name=model_name, inputs=inputs, outputs=outputs) - out0 = results.as_numpy('OUTPUT0') + out0 = results.as_numpy("OUTPUT0") # We expect there to be 1 results (with batch-size 1). Verify # that all 8 result elements are the same as the input. if client[2]: self.assertTrue(np.array_equal(input_.astype(np.bytes_), out0)) else: self.assertTrue( - np.array_equal(input_.astype(np.bytes_), - out0.astype(np.bytes_))) + np.array_equal(input_.astype(np.bytes_), out0.astype(np.bytes_)) + ) return out0 - def _test_unicode_bytes_dtype(self, client, model_name, dtype='|S78'): + def _test_unicode_bytes_dtype(self, client, model_name, dtype="|S78"): # Create the data for the input tensor. Initialize the tensor to 8 # byte strings. (dtype of np.bytes_) # Sample string that should no longer cause failure - in0 = np.array([ - [ - b'\nF\n\'\n\x01a\x12"\x1a \n\x1e\xfa\x03\x94\x01\x0f\xd7\x02\xf1\x05\xdf\x01\x82\x03\xb5\x05\xc1\x07\xba\x06\xff\x06\xc7\x07L\xf5\x03\xe2\x07\xa9\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xdf\\\xcb\xbf' - ], - [ - b'\n:\n\x1a\n\x01a\x12\x15\x1a\x13\n\x11*\xe3\x05\xc5\x06\xda\x07\xcb\x06~\xb1\x05\xb3\x01\xa9\x02\x15\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbb[\n\xbf' - ], - [ - b'\nL\n-\n\x01a\x12(\x1a&\n$\x87\x07\xce\x01\xe7\x06\xee\x04\xe1\x03\xf1\x03\xd7\x07\xbe\x02\xb8\x05\xe0\x05\xe4\x01\x88\x06\xb6\x03\xb9\x05\x83\x06\xf8\x04\xe2\x04\xf4\x06\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbc\x99+@' - ], - [ - b'\n2\n\x12\n\x01a\x12\r\x1a\x0b\n\t\x99\x02\xde\x04\x9f\x04\xc5\x053\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x12\x07\x83\xbe' - ], - [ - b'\nJ\n\r\n\x01b\x12\x08\x1a\x06\n\x04\x9b\x94\xad\x04\n\r\n\x01c\x12\x08\x12\x06\n\x04\xc3\x8a\x08\xbf\n*\n\x01a\x12%\x1a#\n!\x9c\x02\xb2\x02\xcd\x02\x9d\x07\x8d\x01\xb6\x05a\xf1\x01\xf0\x05\xdb\x02\xac\x04\xbd\x05\xe0\x04\xd2\x06\xaf\x02\xa8\x01\x8b\x04' - ], + in0 = np.array( [ - b'\n3\n\x13\n\x01a\x12\x0e\x1a\x0c\n\n<\xe2\x05\x8a\x01\xb3\x07?\xfd\x01\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x1b\x931\xbf\x00\x00' + [ + b"\nF\n'\n\x01a\x12\"\x1a \n\x1e\xfa\x03\x94\x01\x0f\xd7\x02\xf1\x05\xdf\x01\x82\x03\xb5\x05\xc1\x07\xba\x06\xff\x06\xc7\x07L\xf5\x03\xe2\x07\xa9\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xdf\\\xcb\xbf" + ], + [ + b"\n:\n\x1a\n\x01a\x12\x15\x1a\x13\n\x11*\xe3\x05\xc5\x06\xda\x07\xcb\x06~\xb1\x05\xb3\x01\xa9\x02\x15\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbb[\n\xbf" + ], + [ + b"\nL\n-\n\x01a\x12(\x1a&\n$\x87\x07\xce\x01\xe7\x06\xee\x04\xe1\x03\xf1\x03\xd7\x07\xbe\x02\xb8\x05\xe0\x05\xe4\x01\x88\x06\xb6\x03\xb9\x05\x83\x06\xf8\x04\xe2\x04\xf4\x06\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04\xbc\x99+@" + ], + [ + b"\n2\n\x12\n\x01a\x12\r\x1a\x0b\n\t\x99\x02\xde\x04\x9f\x04\xc5\x053\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x12\x07\x83\xbe" + ], + [ + b"\nJ\n\r\n\x01b\x12\x08\x1a\x06\n\x04\x9b\x94\xad\x04\n\r\n\x01c\x12\x08\x12\x06\n\x04\xc3\x8a\x08\xbf\n*\n\x01a\x12%\x1a#\n!\x9c\x02\xb2\x02\xcd\x02\x9d\x07\x8d\x01\xb6\x05a\xf1\x01\xf0\x05\xdb\x02\xac\x04\xbd\x05\xe0\x04\xd2\x06\xaf\x02\xa8\x01\x8b\x04" + ], + [ + b"\n3\n\x13\n\x01a\x12\x0e\x1a\x0c\n\n<\xe2\x05\x8a\x01\xb3\x07?\xfd\x01\n\r\n\x01b\x12\x08\x1a\x06\n\x04\xf6\xa2\xc5\x01\n\r\n\x01c\x12\x08\x12\x06\n\x04\x1b\x931\xbf\x00\x00" + ], + [ + b"\n&\n\x07\n\x01a\x12\x02\x1a\x00\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04{\xbc\x0e>\x00\x00\x00" + ], + [ + b"\nF\n'\n\x01a\x12\"\x1a \n\x1e\x97\x01\x93\x02\x9e\x01\xac\x06\xff\x01\xd8\x05\xe1\x07\xd8\x04g]\x9a\x05\xff\x06\xde\x07\x8f\x04\x97\x04\xda\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x9a\xb7I\n\r\n\x01c\x12\x08\x12\x06\n\x04\xfb\x87\x83\xbf" + ], ], - [ - b'\n&\n\x07\n\x01a\x12\x02\x1a\x00\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x89\xcc=\n\r\n\x01c\x12\x08\x12\x06\n\x04{\xbc\x0e>\x00\x00\x00' - ], - [ - b'\nF\n\'\n\x01a\x12"\x1a \n\x1e\x97\x01\x93\x02\x9e\x01\xac\x06\xff\x01\xd8\x05\xe1\x07\xd8\x04g]\x9a\x05\xff\x06\xde\x07\x8f\x04\x97\x04\xda\x03\n\x0c\n\x01b\x12\x07\x1a\x05\n\x03\x9a\xb7I\n\r\n\x01c\x12\x08\x12\x06\n\x04\xfb\x87\x83\xbf' - ] - ], - dtype=dtype).flatten() + dtype=dtype, + ).flatten() self._test_infer_unicode(model_name, client, in0) def _test_str_dtype(self, client, model_name, dtype=np.object_): @@ -147,30 +142,44 @@ def _test_str_dtype(self, client, model_name, dtype=np.object_): self._test_infer_non_unicode(model_name, client, in0_bytes) def _test_bytes(self, model_name): - dtypes = [np.object_, np.object, np.bytes_] + dtypes = [np.object_, np.bytes_] # This clients will fail for binary_data=False when the binary input # is not UTF-8 encodable. They should work for other cases however. binary_false_clients = [ - (tritonhttpclient.InferenceServerClient("localhost:8000", - verbose=True), - tritonhttpclient, True, False), - (tritonhttpclient.InferenceServerClient("localhost:8000", - verbose=True), - tritonhttpclient, False, False), - (tritonhttpclient.InferenceServerClient("localhost:8000", - verbose=True), - tritonhttpclient, False, True), + ( + tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True), + tritonhttpclient, + True, + False, + ), + ( + tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True), + tritonhttpclient, + False, + False, + ), + ( + tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True), + tritonhttpclient, + False, + True, + ), ] # These clients work for every data type other_clients = [ - (tritongrpcclient.InferenceServerClient("localhost:8001", - verbose=True), - tritongrpcclient, False), - (tritonhttpclient.InferenceServerClient("localhost:8000", - verbose=True), - tritonhttpclient, True, True), + ( + tritongrpcclient.InferenceServerClient("localhost:8001", verbose=True), + tritongrpcclient, + False, + ), + ( + tritonhttpclient.InferenceServerClient("localhost:8000", verbose=True), + tritonhttpclient, + True, + True, + ), ] for client in other_clients + binary_false_clients: @@ -195,5 +204,5 @@ def test_tf_unicode_bytes(self): self._test_bytes("string_identity") -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_tf_gpu_io/test.sh b/qa/L0_tf_gpu_io/test.sh index 2b520d219c..98a5dff1ef 100755 --- a/qa/L0_tf_gpu_io/test.sh +++ b/qa/L0_tf_gpu_io/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -40,9 +40,8 @@ fi export CUDA_VISIBLE_DEVICES=0 -CLIENT=../clients/perf_client +TF_TEST=tf_gpu_io_test.py BACKENDS=${BACKENDS:="graphdef savedmodel"} -TENSOR_SIZE=16384 DATADIR=/data/inferenceserver/${REPO_VERSION} @@ -50,11 +49,9 @@ SERVER=/opt/tritonserver/bin/tritonserver source ../common/util.sh RET=0 - -# -# Use "identity" model for all model types. -# rm -f ./*.log + +# Test with qa identity TF models for BACKEND in $BACKENDS; do MODEL_NAME=${BACKEND}_zero_1_float32 rm -fr models && mkdir -p models @@ -70,7 +67,7 @@ for BACKEND in $BACKENDS; do echo "optimization { execution_accelerators { gpu_execution_accelerator : [ { name : \"gpu_io\"} ] } }" >> config.pbtxt) SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1" - SERVER_LOG="${MODEL_NAME}.serverlog" + SERVER_LOG="${MODEL_NAME}.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -80,60 +77,71 @@ for BACKEND in $BACKENDS; do set +e - $CLIENT -m${MODEL_NAME}_def --shape INPUT0:${TENSOR_SIZE} \ - >> ${BACKEND}.sanity.log 2>&1 + python $TF_TEST TfGpuIoTest.test_${MODEL_NAME}_def >> ${BACKEND}.sanity.log 2>&1 if (( $? != 0 )); then + cat ${BACKEND}.sanity.log RET=1 fi - grep "is GPU tensor: true" $SERVER_LOG + grep "is GPU tensor: true" $SERVER_LOG >> grep.out.log if [ $? -eq 0 ]; then echo -e "\n***\n*** Failed. Expected neither input or output is GPU tensor\n***" RET=1 fi - $CLIENT -m${MODEL_NAME}_gpu --shape INPUT0:${TENSOR_SIZE} \ - >> ${BACKEND}.gpu.sanity.log 2>&1 + python $TF_TEST TfGpuIoTest.test_${MODEL_NAME}_gpu >> ${BACKEND}.gpu.sanity.log 2>&1 if (( $? != 0 )); then + cat ${BACKEND}.gpu.sanity.log RET=1 fi - grep "is GPU tensor: true" $SERVER_LOG + grep "is GPU tensor: true" $SERVER_LOG >> grep.out.log if [ $? -ne 0 ]; then echo -e "\n***\n*** Failed. Expected input and output are GPU tensors\n***" RET=1 fi - # Sample latency results - $CLIENT -m${MODEL_NAME}_def --shape INPUT0:${TENSOR_SIZE} \ - >> ${BACKEND}.log 2>&1 - if (( $? != 0 )); then - RET=1 - fi - - $CLIENT -m${MODEL_NAME}_gpu --shape INPUT0:${TENSOR_SIZE} \ - >> ${BACKEND}.gpu.log 2>&1 - if (( $? != 0 )); then - RET=1 - fi - set -e kill $SERVER_PID wait $SERVER_PID done -for BACKEND in $BACKENDS; do - echo -e "\n${BACKEND}\n************" - cat ${BACKEND}.log - echo -e "\n${BACKEND} with GPU I/O\n************" - cat ${BACKEND}.gpu.log -done +# Test savedmodel with mismatched key and name +rm -rf models && mkdir -p models +cp -r $DATADIR/qa_tf_tag_sigdef_repository/sig_tag0 models +(cd models/sig_tag0 && \ + echo "optimization { execution_accelerators { gpu_execution_accelerator : [ { name : \"gpu_io\"} ] } }" >> config.pbtxt) + +SERVER_ARGS="--model-repository=`pwd`/models --log-verbose=1" +SERVER_LOG="sig_tag0.server.log" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +CLIENT_LOG="sig_tag0.gpu.log" +python $TF_TEST TfGpuIoTest.test_sig_tag0 >> $CLIENT_LOG 2>&1 +if (( $? != 0 )); then + cat $CLIENT_LOG + RET=1 +fi +grep "is GPU tensor: true" $SERVER_LOG >> grep.out.log +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Failed. Expected input and output are GPU tensors\n***" + RET=1 +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else echo -e "\n***\n*** Test FAILED\n***" fi - exit $RET diff --git a/qa/L0_tf_gpu_io/tf_gpu_io_test.py b/qa/L0_tf_gpu_io/tf_gpu_io_test.py new file mode 100755 index 0000000000..fd3550e434 --- /dev/null +++ b/qa/L0_tf_gpu_io/tf_gpu_io_test.py @@ -0,0 +1,105 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import unittest + +import infer_util as iu +import numpy as np +import test_util as tu + +TENSOR_SIZE = 16384 + + +class TfGpuIoTest(tu.TestResultCollector): + def _test_helper( + self, + model_name, + shape, + override_input_names=[], + override_output_names=[], + batching_enabled=False, + ): + try: + bs = 1 + if batching_enabled: + shape = [ + [ + bs, + ] + + shape + ] + iu.infer_zero( + self, + "graphdef", + bs, + np.float32, + shape, + shape, + override_model_name=model_name, + override_input_names=override_input_names, + override_output_names=override_output_names, + ) + + except Exception as ex: + self.assertTrue(False, "unexpected error {}".format(ex)) + + def test_sig_tag0(self): + self._test_helper( + "sig_tag0", + [16], + override_input_names=["INPUT"], + override_output_names=["OUTPUT"], + ) + + def test_graphdef_zero_1_float32_def(self): + self._test_helper( + "graphdef_zero_1_float32_def", [TENSOR_SIZE], batching_enabled=True + ) + + def test_graphdef_zero_1_float32_gpu(self): + self._test_helper( + "graphdef_zero_1_float32_gpu", [TENSOR_SIZE], batching_enabled=True + ) + + def test_savedmodel_zero_1_float32_def(self): + self._test_helper( + "savedmodel_zero_1_float32_def", [TENSOR_SIZE], batching_enabled=True + ) + + def test_savedmodel_zero_1_float32_gpu(self): + self._test_helper( + "savedmodel_zero_1_float32_gpu", [TENSOR_SIZE], batching_enabled=True + ) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_tf_parameters/test.sh b/qa/L0_tf_parameters/test.sh new file mode 100755 index 0000000000..133b6ef68d --- /dev/null +++ b/qa/L0_tf_parameters/test.sh @@ -0,0 +1,150 @@ +#!/bin/bash +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi +source ../common/util.sh + +export CUDA_VISIBLE_DEVICES=0 + +DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_tf_parameters_repository +TEST_RESULT_FILE='test_results.txt' +CLIENT_LOG="./client.log" +TEST=tf_parameter_test.py +EXPECTED_NUM_TESTS="1" +MODEL_REPOSITORY=`pwd`/models +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_LOG="./inference_server.log" + +RET=0 + +rm -rf $SERVER_LOG $CLIENT_LOG models/ +cp -r $DATADIR models +SERVER_ARGS="--model-repository=$MODEL_REPOSITORY" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python $TEST TFParameterTest.test_tf_variable_error>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# Add the initialization operation +echo "{\"init_ops\": [\"init\"]}" > models/graphdef_variable/init_ops.json +echo "parameters: { key: \"TF_INIT_OPS_FILE\" value: { string_value:\"init_ops.json\" }}" >> models/graphdef_variable/config.pbtxt + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python $TEST TFParameterTest.test_tf_variable>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +# Move the initialization op to the model version folder. +mv models/graphdef_variable/init_ops.json models/graphdef_variable/1/ + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python $TEST TFParameterTest.test_tf_variable>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + cat $CLIENT_LOG + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_tf_parameters/tf_parameter_test.py b/qa/L0_tf_parameters/tf_parameter_test.py new file mode 100755 index 0000000000..f1a4621d93 --- /dev/null +++ b/qa/L0_tf_parameters/tf_parameter_test.py @@ -0,0 +1,81 @@ +#!/usr/bin/env python3 + +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import unittest + +import numpy as np +import test_util as tu +import tritonclient.http as tritonhttpclient +import tritonclient.utils + + +class TFParameterTest(tu.TestResultCollector): + def setUp(self): + self._client = tritonhttpclient.InferenceServerClient( + "localhost:8000", verbose=True + ) + + def _infer_helper(self): + # The model has a single variable which is added to the input. Since the + # variable is initialized to zero the input and output must match. + model_name = "graphdef_variable" + input = np.array([10], dtype=np.int32) + + inputs = [] + inputs.append(tritonhttpclient.InferInput("INPUT", input.shape, "INT32")) + inputs[-1].set_data_from_numpy(input) + + outputs = [] + outputs.append(tritonhttpclient.InferRequestedOutput("OUTPUT")) + + results = self._client.infer( + model_name=model_name, inputs=inputs, outputs=outputs + ) + output = results.as_numpy("OUTPUT") + np.testing.assert_array_equal(output, input) + + def test_tf_variable(self): + self._infer_helper() + + def test_tf_variable_error(self): + with self.assertRaises(tritonclient.utils.InferenceServerException) as e: + self._infer_helper() + self.assertIn( + "FAILED_PRECONDITION: Could not find variable VARIABLE. This " + + "could mean that the variable has been deleted. In TF1, it can " + + "also mean the variable is uninitialized.", + e.exception.message(), + ) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_tf_tag_sigdef/test.sh b/qa/L0_tf_tag_sigdef/test.sh index 8a0295d810..32248c74ad 100755 --- a/qa/L0_tf_tag_sigdef/test.sh +++ b/qa/L0_tf_tag_sigdef/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -43,22 +43,29 @@ export CUDA_VISIBLE_DEVICES=0 TEST_RESULT_FILE='test_results.txt' CLIENT_LOG="./client.log" TEST=tf_tag_sigdef_test.py -MAKE_MODEL=gen_tag_sigdef.py DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_tf_tag_sigdef_repository +MODELDIR=`pwd`/models + +rm -rf $SERVER_LOG $CLIENT_LOG $MODELDIR +mkdir $MODELDIR +cp -r $DATADIR/* $MODELDIR + EXPECTED_NUM_TESTS="4" SERVER=/opt/tritonserver/bin/tritonserver -SERVER_ARGS="--model-repository=$DATADIR --exit-timeout-secs=120" +SERVER_ARGS="--model-repository=$MODELDIR --exit-timeout-secs=120" SERVER_LOG="./inference_server.log" source ../common/util.sh -rm -f $SERVER_LOG $CLIENT_LOG - RET=0 run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" + if [ `grep -c "configuration expects 2 inputs, model provides 1" $SERVER_LOG` != "0" ]; then + echo -e "*** FAILED: sig_tag_different_io config autocompleted with wrong model tag variant, failed to load.\n" + RET=1 + fi cat $SERVER_LOG exit 1 fi diff --git a/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py b/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py old mode 100644 new mode 100755 index d3739afea4..b4a11ac04e --- a/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py +++ b/qa/L0_tf_tag_sigdef/tf_tag_sigdef_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,14 +30,11 @@ sys.path.append("../common") -from builtins import range -from future.utils import iteritems import unittest + import numpy as np -import os import test_util as tu import tritonhttpclient as httpclient -from tritonclientutils import InferenceServerException class TagSigdefTest(tu.TestResultCollector): @@ -53,16 +52,14 @@ def _test_helper(self, modelVersion, tag, sig_def): # for details multiplier = modelVersion + 1 output_name = "OUTPUT" - triton_client = httpclient.InferenceServerClient("localhost:8000", - verbose=True) + triton_client = httpclient.InferenceServerClient("localhost:8000", verbose=True) inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT', shape, "FP32")) + inputs.append(httpclient.InferInput("INPUT", shape, "FP32")) input_data = np.ones(shape=shape).astype(np.float32) inputs[0].set_data_from_numpy(input_data, binary_data=True) - outputs.append( - httpclient.InferRequestedOutput(output_name, binary_data=True)) + outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True)) results = triton_client.infer(model_name, inputs, outputs=outputs) output_data = results.as_numpy(output_name) test_output = input_data * multiplier @@ -81,5 +78,5 @@ def test_tag_sig_def(self): self._test_helper(3, self.test_tag, self.test_sig_def) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_tf_unknown_rank/test.sh b/qa/L0_tf_unknown_rank/test.sh old mode 100644 new mode 100755 index ab9db57f24..e279a46267 --- a/qa/L0_tf_unknown_rank/test.sh +++ b/qa/L0_tf_unknown_rank/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -79,7 +79,7 @@ else fi fi -python $UNKNOWN_RANK_TEST UnknownRankTest.test_wrong_output >> $CLIENT_LOG 2>&1 +python $UNKNOWN_RANK_TEST UnknownRankTest.test_wrong_input >> $CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then echo -e "\n***\n*** Test Failed\n***" cat $CLIENT_LOG @@ -109,9 +109,10 @@ if [ "$SERVER_PID" != "0" ]; then kill $SERVER_PID wait $SERVER_PID else - ERROR_MESSAGE="unable to autofill for 'scalar_model': the rank of model tensor 'x' is 0 which is not supported" + ERROR_MESSAGE="Unable to autofill for 'scalar_model': the rank of model tensor 'x' is 0 and dimensions are not defined" if [[ $(cat $SERVER_LOG | grep "${ERROR_MESSAGE}" | wc -l) -ne 2 ]]; then echo -e "\n***\n*** Test Failed: "${ERROR_MESSAGE}" not found\n***" + cat $SERVER_LOG RET=1 fi fi diff --git a/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py b/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py old mode 100644 new mode 100755 index 427220a782..add6b32c13 --- a/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py +++ b/qa/L0_tf_unknown_rank/tf_unknown_rank_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,9 +27,11 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") import unittest + import numpy as np import test_util as tu import tritonhttpclient @@ -39,33 +43,40 @@ class UnknownRankTest(tu.TestResultCollector): def infer_unknown(self, model_name, tensor_shape): print("About to run the test") input_data = np.random.random_sample(tensor_shape).astype(np.float32) - client = tritonhttpclient.InferenceServerClient('localhost:8000') + client = tritonhttpclient.InferenceServerClient("localhost:8000") inputs = [ - tritonhttpclient.InferInput("INPUT", input_data.shape, - np_to_triton_dtype(input_data.dtype)) + tritonhttpclient.InferInput( + "INPUT", input_data.shape, np_to_triton_dtype(input_data.dtype) + ) ] inputs[0].set_data_from_numpy(input_data) results = client.infer(model_name, inputs) - self.assertTrue(np.array_equal(results.as_numpy('OUTPUT'), input_data)) + self.assertTrue(np.array_equal(results.as_numpy("OUTPUT"), input_data)) def test_success(self): model_name = "unknown_rank_success" - tensor_shape = (1,) + tensor_shape = 1 try: self.infer_unknown(model_name, tensor_shape) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) - def test_wrong_output(self): - tensor_shape = (1,) + def test_wrong_input(self): model_name = "unknown_rank_wrong_output" + tensor_shape = (1, 2) try: self.infer_unknown(model_name, tensor_shape) + self.fail( + "Found success when expected failure with model given " + "wrong input tensor [1,2] for input [-1,1]." + ) except InferenceServerException as ex: - self.assertIn("tensor \'OUTPUT\': the model expects 1 dimensions " \ - "(shape [1]) but the model configuration specifies 2 dimensions " \ - "(shape [1,1])", ex.message()) + self.assertIn( + "unexpected shape for input 'INPUT' for model " + "'unknown_rank_wrong_output'. Expected [1], got [1,2]", + ex.message(), + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_tftrt_optimization/tftrt_optimization_test.py b/qa/L0_tftrt_optimization/tftrt_optimization_test.py old mode 100644 new mode 100755 index b25734f606..9e59677317 --- a/qa/L0_tftrt_optimization/tftrt_optimization_test.py +++ b/qa/L0_tftrt_optimization/tftrt_optimization_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,51 +27,49 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") import unittest + import numpy as np import test_util as tu import tritonhttpclient as httpclient -from tritonclientutils import InferenceServerException class TFTRTOptimizationTest(tu.TestResultCollector): - def setUp(self): - self.input0_ = np.arange(start=0, stop=16, - dtype=np.float32).reshape(1, 16) + self.input0_ = np.arange(start=0, stop=16, dtype=np.float32).reshape(1, 16) self.input1_ = np.ones(shape=16, dtype=np.float32).reshape(1, 16) self.expected_output0_ = self.input0_ + self.input1_ self.expected_output1_ = self.input0_ - self.input1_ def _addsub_infer(self, model_name): - triton_client = httpclient.InferenceServerClient("localhost:8000", - verbose=True) + triton_client = httpclient.InferenceServerClient("localhost:8000", verbose=True) inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "FP32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "FP32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "FP32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "FP32")) # Initialize the data inputs[0].set_data_from_numpy(self.input0_, binary_data=True) inputs[1].set_data_from_numpy(self.input1_, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=True)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=True)) results = triton_client.infer(model_name, inputs, outputs=outputs) - output0_data = results.as_numpy('OUTPUT0') - output1_data = results.as_numpy('OUTPUT1') + output0_data = results.as_numpy("OUTPUT0") + output1_data = results.as_numpy("OUTPUT1") - self.assertTrue(np.array_equal(self.expected_output0_, output0_data), - "incorrect sum") - self.assertTrue(np.array_equal(self.expected_output1_, output1_data), - "incorrect difference") + self.assertTrue( + np.array_equal(self.expected_output0_, output0_data), "incorrect sum" + ) + self.assertTrue( + np.array_equal(self.expected_output1_, output1_data), "incorrect difference" + ) def test_graphdef(self): self._addsub_infer("graphdef_float32_float32_float32_trt") @@ -80,5 +80,5 @@ def test_savedmodel(self): self._addsub_infer("savedmodel_float32_float32_float32_param") -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_trace/opentelemetry_unittest.py b/qa/L0_trace/opentelemetry_unittest.py new file mode 100644 index 0000000000..5055f4e88a --- /dev/null +++ b/qa/L0_trace/opentelemetry_unittest.py @@ -0,0 +1,274 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") +import json +import re +import unittest + +import numpy as np +import test_util as tu +import tritonclient.grpc as grpcclient +import tritonclient.http as httpclient + +EXPECTED_NUM_SPANS = 16 +# OpenTelemetry OStream exporter sets `parent_span_id` to "0000000000000000", +# if current span is a root span, i.e. there is no parent span. +# https://github.com/open-telemetry/opentelemetry-cpp/blob/b7fd057185c4ed2dff507b859cbe058b7609fb4a/exporters/ostream/src/span_exporter.cc#L78C54-L78C68 +NO_PARENT_SPAN = "0000000000000000" + + +class OpenTelemetryTest(tu.TestResultCollector): + def setUp(self): + # Extracted spans are in json-like format, thus data needs to be + # post-processed, so that `json` could accept it for further + # processing + with open("trace_collector.log", "rt") as f: + data = f.read() + # Removing new lines and tabs around `{` + json_string = re.sub("\n\t{\n\t", "{", data) + # `resources` field is a dictionary, so adding `{` and`}` + # in the next 2 transformations, `instr-lib` is a next field, + # so whatever goes before it, belongs to `resources`. + json_string = re.sub( + "resources : \n\t", "resources : {\n\t", json_string + ) + json_string = re.sub( + "\n instr-lib :", "}\n instr-lib :", json_string + ) + # `json`` expects "key":"value" format, some fields in the + # data have empty string as value, so need to add `"",` + json_string = re.sub(": \n\t", ':"",', json_string) + json_string = re.sub(": \n", ':"",', json_string) + # Extracted data missing `,' after each key-value pair, + # which `json` exppects + json_string = re.sub("\n|\n\t", ",", json_string) + # Removing tabs + json_string = re.sub("\t", "", json_string) + # `json` expects each key and value have `"`'s, so adding them to + # every word/number/alpha-numeric entry + json_string = re.sub(r"\b([\w.-]+)\b", r'"\1"', json_string) + # `span kind`` represents one key + json_string = re.sub('"span" "kind"', '"span kind"', json_string) + # Removing extra `,` + json_string = re.sub("{,", "{", json_string) + json_string = re.sub(",}", "}", json_string) + # Adding `,` between dictionary entries + json_string = re.sub("}{", "},{", json_string) + # `events` is a list of dictionaries, `json` will accept it in the + # form of "events" : [{....}, {.....}, ...] + json_string = re.sub( + '"events" : {', '"events" : [{', json_string + ) + # Closing `events`' list of dictionaries + json_string = re.sub('}, "links"', '}], "links"', json_string) + # Last 2 symbols are not needed + json_string = json_string[:-2] + # Since now `json_string` is a string, which represents dictionaries, + # we put it into one dictionary, so that `json` could read it as one. + json_string = '{ "spans" :[' + json_string + "] }" + self.spans = json.loads(json_string)["spans"] + + self.simple_model_name = "simple" + self.ensemble_model_name = "ensemble_add_sub_int32_int32_int32" + self.bls_model_name = "bls_simple" + self.root_span = "InferRequest" + + def _check_events(self, span_name, events): + root_events_http = [ + "HTTP_RECV_START", + "HTTP_RECV_END", + "INFER_RESPONSE_COMPLETE", + "HTTP_SEND_START", + "HTTP_SEND_END", + ] + root_events_grpc = [ + "GRPC_WAITREAD_START", + "GRPC_WAITREAD_END", + "INFER_RESPONSE_COMPLETE", + "GRPC_SEND_START", + "GRPC_SEND_END", + ] + request_events = ["REQUEST_START", "QUEUE_START", "REQUEST_END"] + compute_events = [ + "COMPUTE_START", + "COMPUTE_INPUT_END", + "COMPUTE_OUTPUT_START", + "COMPUTE_END", + ] + + if span_name == "compute": + # Check that all compute related events (and only them) + # are recorded in compute span + self.assertTrue(all(entry in events for entry in compute_events)) + self.assertFalse(all(entry in events for entry in request_events)) + self.assertFalse( + all(entry in events for entry in root_events_http + root_events_grpc) + ) + + elif span_name == self.root_span: + # Check that root span has INFER_RESPONSE_COMPLETE, _RECV/_WAITREAD + # and _SEND events (and only them) + if "HTTP" in events: + self.assertTrue(all(entry in events for entry in root_events_http)) + self.assertFalse(all(entry in events for entry in root_events_grpc)) + + elif "GRPC" in events: + self.assertTrue(all(entry in events for entry in root_events_grpc)) + self.assertFalse(all(entry in events for entry in root_events_http)) + self.assertFalse(all(entry in events for entry in request_events)) + self.assertFalse(all(entry in events for entry in compute_events)) + + elif span_name == self.simple_model_name: + # Check that all request related events (and only them) + # are recorded in request span + self.assertTrue(all(entry in events for entry in request_events)) + self.assertFalse( + all(entry in events for entry in root_events_http + root_events_grpc) + ) + self.assertFalse(all(entry in events for entry in compute_events)) + + def _check_parent(self, child_span, parent_span): + # Check that child and parent span have the same trace_id + # and child's `parent_span_id` is the same as parent's `span_id` + self.assertEqual(child_span["trace_id"], parent_span["trace_id"]) + self.assertNotEqual( + child_span["parent_span_id"], + NO_PARENT_SPAN, + "child span does not have parent span id specified", + ) + self.assertEqual( + child_span["parent_span_id"], + parent_span["span_id"], + "child {} , parent {}".format(child_span, parent_span), + ) + + def test_spans(self): + parsed_spans = [] + + # Check that collected spans have proper events recorded + for span in self.spans: + span_name = span["name"] + self._check_events(span_name, str(span["events"])) + parsed_spans.append(span_name) + + # There should be 16 spans in total: + # 3 for http request, 3 for grpc request, 4 for ensemble, 6 for bls + self.assertEqual(len(self.spans), EXPECTED_NUM_SPANS) + # We should have 5 compute spans + self.assertEqual(parsed_spans.count("compute"), 5) + # 7 request spans + # (4 named simple - same as our model name, 2 ensemble, 1 bls) + self.assertEqual(parsed_spans.count(self.simple_model_name), 4) + self.assertEqual(parsed_spans.count(self.ensemble_model_name), 2) + self.assertEqual(parsed_spans.count(self.bls_model_name), 1) + # 4 root spans + self.assertEqual(parsed_spans.count(self.root_span), 4) + + def test_nested_spans(self): + # First 3 spans in `self.spans` belong to HTTP request + # They are recorded in the following order: + # compute_span [idx 0] , request_span [idx 1], root_span [idx 2]. + # compute_span should be a child of request_span + # request_span should be a child of root_span + for child, parent in zip(self.spans[:3], self.spans[1:3]): + self._check_parent(child, parent) + + # Next 3 spans in `self.spans` belong to GRPC request + # Order of spans and their relationship described earlier + for child, parent in zip(self.spans[3:6], self.spans[4:6]): + self._check_parent(child, parent) + + # Next 4 spans in `self.spans` belong to ensemble request + # Order of spans: compute span - request span - request span - root span + for child, parent in zip(self.spans[6:10], self.spans[7:10]): + self._check_parent(child, parent) + + # Final 6 spans in `self.spans` belong to bls with ensemble request + # Order of spans: + # compute span - request span (simple) - request span (ensemble)- + # - compute (for bls) - request (bls) - root span + # request span (ensemble) and compute (for bls) are children of + # request (bls) + children = self.spans[10:] + parents = (self.spans[11:13], self.spans[14], self.spans[14:]) + for child, parent in zip(children, parents[0]): + self._check_parent(child, parent) + + def test_resource_attributes(self): + for span in self.spans: + self.assertIn("test.key", span["resources"]) + self.assertEqual("test.value", span["resources"]["test.key"]) + self.assertIn("service.name", span["resources"]) + self.assertEqual("test_triton", span["resources"]["service.name"]) + + +def prepare_data(client): + inputs = [] + input0_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32) + input1_data = np.full(shape=(1, 16), fill_value=-1, dtype=np.int32) + + inputs.append(client.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(client.InferInput("INPUT1", [1, 16], "INT32")) + + # Initialize the data + inputs[0].set_data_from_numpy(input0_data) + inputs[1].set_data_from_numpy(input1_data) + + return inputs + + +def prepare_traces(): + triton_client_http = httpclient.InferenceServerClient( + "localhost:8000", verbose=True + ) + triton_client_grpc = grpcclient.InferenceServerClient( + "localhost:8001", verbose=True + ) + inputs = prepare_data(httpclient) + triton_client_http.infer("simple", inputs) + + inputs = prepare_data(grpcclient) + triton_client_grpc.infer("simple", inputs) + + inputs = prepare_data(httpclient) + triton_client_http.infer("ensemble_add_sub_int32_int32_int32", inputs) + + send_bls_request(model_name="ensemble_add_sub_int32_int32_int32") + + +def send_bls_request(model_name="simple"): + with httpclient.InferenceServerClient("localhost:8000") as client: + inputs = prepare_data(httpclient) + inputs.append(httpclient.InferInput("MODEL_NAME", [1], "BYTES")) + inputs[-1].set_data_from_numpy(np.array([model_name], dtype=np.object_)) + client.infer("bls_simple", inputs) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_trace/test.sh b/qa/L0_trace/test.sh index c7130a5645..56f3250b81 100755 --- a/qa/L0_trace/test.sh +++ b/qa/L0_trace/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -51,6 +51,7 @@ export CUDA_VISIBLE_DEVICES=0 DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository ENSEMBLEDIR=$DATADIR/../qa_ensemble_model_repository/qa_model_repository/ +BLSDIR=../python_models/bls_simple MODELBASE=onnx_int32_int32_int32 MODELSDIR=`pwd`/trace_models @@ -62,19 +63,94 @@ rm -f *.log rm -fr $MODELSDIR && mkdir -p $MODELSDIR # set up simple and global_simple model using MODELBASE -rm -fr $MODELSDIR && mkdir -p $MODELSDIR && \ - cp -r $DATADIR/$MODELBASE $MODELSDIR/simple && \ +cp -r $DATADIR/$MODELBASE $MODELSDIR/simple && \ rm -r $MODELSDIR/simple/2 && rm -r $MODELSDIR/simple/3 && \ (cd $MODELSDIR/simple && \ sed -i "s/^name:.*/name: \"simple\"/" config.pbtxt) && \ cp -r $MODELSDIR/simple $MODELSDIR/global_simple && \ (cd $MODELSDIR/global_simple && \ sed -i "s/^name:.*/name: \"global_simple\"/" config.pbtxt) && \ + cp -r $ENSEMBLEDIR/simple_onnx_int32_int32_int32 $MODELSDIR/ensemble_add_sub_int32_int32_int32 && \ + rm -r $MODELSDIR/ensemble_add_sub_int32_int32_int32/2 && \ + rm -r $MODELSDIR/ensemble_add_sub_int32_int32_int32/3 && \ + (cd $MODELSDIR/ensemble_add_sub_int32_int32_int32 && \ + sed -i "s/^name:.*/name: \"ensemble_add_sub_int32_int32_int32\"/" config.pbtxt && \ + sed -i "s/model_name:.*/model_name: \"simple\"/" config.pbtxt) && \ + mkdir -p $MODELSDIR/bls_simple/1 && cp $BLSDIR/bls_simple.py $MODELSDIR/bls_simple/1/model.py RET=0 +# Helpers ======================================= +function assert_curl_success { + message="${1}" + if [ "$code" != "200" ]; then + cat ./curl.out + echo -e "\n***\n*** ${message} : line ${BASH_LINENO}\n***" + RET=1 + fi +} + +function assert_curl_failure { + message="${1}" + if [ "$code" != "400" ]; then + cat ./curl.out + echo -e "\n***\n*** ${message} : line ${BASH_LINENO}\n***" + RET=1 + fi +} + +function get_global_trace_setting { + rm -f ./curl.out + set +e + code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/trace/setting` + set -e +} + +function get_trace_setting { + model_name="${1}" + rm -f ./curl.out + set +e + code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/${model_name}/trace/setting` + set -e +} + +function update_global_trace_setting { + settings="${1}" + rm -f ./curl.out + set +e + code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/trace/setting -d ${settings}` + set -e +} + +function update_trace_setting { + model_name="${1}" + settings="${2}" + rm -f ./curl.out + set +e + code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/models/${model_name}/trace/setting -d ${settings}` + set -e +} + +function send_inference_requests { + log_file="${1}" + upper_bound="${2}" + for (( p = 1; p <= $upper_bound; p++ )) do + $SIMPLE_HTTP_CLIENT >> ${log_file} 2>&1 + if [ $? -ne 0 ]; then + RET=1 + fi + + $SIMPLE_GRPC_CLIENT >> ${log_file} 2>&1 + if [ $? -ne 0 ]; then + RET=1 + fi + done +} + +#======================================= + # start with trace-level=OFF -SERVER_ARGS="--trace-file=trace_off_to_min.log --trace-level=OFF --trace-rate=1 --model-repository=$MODELSDIR" +SERVER_ARGS="--trace-config triton,file=trace_off_to_min.log --trace-config level=OFF --trace-config rate=1 --model-repository=$MODELSDIR" SERVER_LOG="./inference_server_off.log" run_server if [ "$SERVER_PID" == "0" ]; then @@ -85,28 +161,10 @@ fi set +e -for p in {1..10}; do - $SIMPLE_HTTP_CLIENT >> client_off.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi - - $SIMPLE_GRPC_CLIENT >> client_off.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi -done - # Enable via trace API and send again -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_level":["TIMESTAMPS"]}' localhost:8000/v2/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi +update_global_trace_setting '{"trace_level":["TIMESTAMPS"]}' +assert_curl_success "Failed to modify global trace settings" + # Check if the current setting is returned if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then RET=1 @@ -121,17 +179,7 @@ if [ `grep -c "\"trace_file\":\"trace_off_to_min.log\"" ./curl.out` != "1" ]; th RET=1 fi -for p in {1..10}; do - $SIMPLE_HTTP_CLIENT >> client_min.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi - - $SIMPLE_GRPC_CLIENT >> client_min.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi -done +send_inference_requests "client_min.log" 10 set -e @@ -140,7 +188,7 @@ wait $SERVER_PID set +e -# Expect only the requests after calling trace API are traced +# Expect only the requests after calling trace API are traced $TRACE_SUMMARY -t trace_off_to_min.log > summary_off_to_min.log if [ `grep -c "COMPUTE_INPUT_END" summary_off_to_min.log` != "20" ]; then @@ -158,7 +206,7 @@ fi set -e # Add model specific setting -SERVER_ARGS="--trace-file=global_trace.log --trace-level=TIMESTAMPS --trace-rate=6 --model-repository=$MODELSDIR" +SERVER_ARGS="--trace-config triton,file=global_trace.log --trace-config level=TIMESTAMPS --trace-config rate=6 --model-repository=$MODELSDIR" SERVER_LOG="./inference_server_off.log" run_server if [ "$SERVER_PID" == "0" ]; then @@ -170,16 +218,10 @@ fi set +e # Add trace setting for 'simple' via trace API, first use the same trace file -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"global_trace.log"}' localhost:8000/v2/models/simple/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi -# Check if the current setting is returned (not specified setting from global) +update_trace_setting "simple" '{"trace_file":"global_trace.log"}' +assert_curl_success "Failed to modify trace settings for 'simple' model" + +# Check if the current setting is returned (not specified setting from global) if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then RET=1 fi @@ -194,17 +236,10 @@ if [ `grep -c "\"trace_file\":\"global_trace.log\"" ./curl.out` != "1" ]; then fi # Use a different name -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"simple_trace.log","log_frequency":"2"}' localhost:8000/v2/models/simple/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi +update_trace_setting "simple" '{"trace_file":"simple_trace.log","log_frequency":"2"}' +assert_curl_success "Failed to modify trace settings for 'simple' model" -# Check if the current setting is returned (not specified setting from global) +# Check if the current setting is returned (not specified setting from global) if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then RET=1 fi @@ -221,17 +256,7 @@ if [ `grep -c "\"trace_file\":\"simple_trace.log\"" ./curl.out` != "1" ]; then RET=1 fi -for p in {1..10}; do - $SIMPLE_HTTP_CLIENT >> client_simple.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi - - $SIMPLE_GRPC_CLIENT >> client_simple.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi -done +send_inference_requests "client_simple.log" 10 set -e @@ -276,7 +301,7 @@ fi set -e # Update and clear model specific setting -SERVER_ARGS="--trace-file=global_trace.log --trace-level=TIMESTAMPS --trace-rate=6 --model-repository=$MODELSDIR" +SERVER_ARGS="--trace-config triton,file=global_trace.log --trace-config level=TIMESTAMPS --trace-config rate=6 --model-repository=$MODELSDIR" SERVER_LOG="./inference_server_off.log" run_server if [ "$SERVER_PID" == "0" ]; then @@ -288,25 +313,11 @@ fi set +e # Add model setting and update it -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"update_trace.log", "trace_rate":"1"}' localhost:8000/v2/models/simple/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi +update_trace_setting "simple" '{"trace_file":"update_trace.log","trace_rate":"1"}' +assert_curl_success "Failed to modify trace settings for 'simple' model" -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":"update_trace.log", "trace_level":["OFF"]}' localhost:8000/v2/models/simple/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi +update_trace_setting "simple" '{"trace_file":"update_trace.log","trace_level":["OFF"]}' +assert_curl_success "Failed to modify trace settings for 'simple' model" # Check if the current setting is returned if [ `grep -c "\"trace_level\":\[\"OFF\"\]" ./curl.out` != "1" ]; then @@ -326,31 +337,14 @@ if [ `grep -c "\"trace_file\":\"update_trace.log\"" ./curl.out` != "1" ]; then fi # Send requests to simple where trace is explicitly disabled -for p in {1..10}; do - $SIMPLE_HTTP_CLIENT >> client_update.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi - - $SIMPLE_GRPC_CLIENT >> client_update.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi -done +send_inference_requests "client_update.log" 10 rm -f ./curl.out set +e -# Clear trace setting by explicitly asking removal for every feild except 'trace_rate' -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_file":null, "trace_level":null}' localhost:8000/v2/models/simple/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi +# Clear trace setting by explicitly asking removal for every field except 'trace_rate' +update_trace_setting "simple" '{"trace_file":null,"trace_level":null}' +assert_curl_success "Failed to modify trace settings for 'simple' model" # Check if the current setting (global) is returned if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then @@ -370,17 +364,7 @@ if [ `grep -c "\"trace_file\":\"global_trace.log\"" ./curl.out` != "1" ]; then fi # Send requests to simple where now uses global setting -for p in {1..5}; do - $SIMPLE_HTTP_CLIENT >> client_clear.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi - - $SIMPLE_GRPC_CLIENT >> client_clear.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi -done +send_inference_requests "client_clear.log" 5 set -e @@ -411,7 +395,7 @@ fi set -e # Update trace count -SERVER_ARGS="--trace-file=global_count.log --trace-level=TIMESTAMPS --trace-rate=1 --model-repository=$MODELSDIR" +SERVER_ARGS="--trace-config triton,file=global_count.log --trace-config level=TIMESTAMPS --trace-config rate=1 --model-repository=$MODELSDIR" SERVER_LOG="./inference_server_off.log" run_server if [ "$SERVER_PID" == "0" ]; then @@ -423,30 +407,14 @@ fi set +e # Send requests without trace count -for p in {1..10}; do - $SIMPLE_HTTP_CLIENT >> client_update.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi - - $SIMPLE_GRPC_CLIENT >> client_update.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi -done +send_inference_requests "client_update.log" 10 set -e # Check the current setting -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/simple/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi +get_trace_setting "simple" +assert_curl_success "Failed to obtain trace settings for 'simple' model" + if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then RET=1 fi @@ -464,15 +432,8 @@ if [ `grep -c "\"trace_file\":\"global_count.log\"" ./curl.out` != "1" ]; then fi # Set trace count -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out -d'{"trace_count":"5"}' localhost:8000/v2/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi +update_global_trace_setting '{"trace_count":"5"}' +assert_curl_success "Failed to modify global trace settings" # Check if the current setting is returned if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then @@ -492,28 +453,12 @@ if [ `grep -c "\"trace_file\":\"global_count.log\"" ./curl.out` != "1" ]; then fi # Send requests to simple where trace is explicitly disabled -for p in {1..10}; do - $SIMPLE_HTTP_CLIENT >> client_update.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi +send_inference_requests "client_update.log" 10 - $SIMPLE_GRPC_CLIENT >> client_update.log 2>&1 - if [ $? -ne 0 ]; then - RET=1 - fi -done +# Check the current setting again and expect 'trace_count' becomes 0 +get_trace_setting "simple" +assert_curl_success "Failed to obtain trace settings for 'simple' model" -# Check the current setting agian and expect 'trace_count' becomes 0 -rm -f ./curl.out -set +e -code=`curl -s -w %{http_code} -o ./curl.out localhost:8000/v2/models/simple/trace/setting` -set -e -if [ "$code" != "200" ]; then - cat ./curl.out - echo -e "\n***\n*** Test Failed\n***" - RET=1 -fi if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then RET=1 fi @@ -536,6 +481,14 @@ if [ -f ./global_trace.log.0 ]; then RET=1 fi +SETTINGS="trace_count trace_rate log_frequency" + +for SETTING in $SETTINGS; do + # Check `out of range` errors + update_trace_setting "simple" '{"'${SETTING}'":"10000000000"}' + assert_curl_failure "Server modified '${SETTING}' with an out of range value." +done + set -e kill $SERVER_PID @@ -576,7 +529,7 @@ fi set -e # Test Python client library -SERVER_ARGS="--trace-file=global_unittest.log --trace-level=TIMESTAMPS --trace-rate=1 --model-repository=$MODELSDIR" +SERVER_ARGS="--trace-config triton,file=global_unittest.log --trace-config level=TIMESTAMPS --trace-config rate=1 --model-repository=$MODELSDIR" SERVER_LOG="./inference_server_unittest.log" run_server if [ "$SERVER_PID" == "0" ]; then @@ -607,11 +560,249 @@ set -e kill $SERVER_PID wait $SERVER_PID -if [ $RET -eq 0 ]; then - echo -e "\n***\n*** Test Passed\n***" -else - echo -e "\n***\n*** Test FAILED\n***" + +# Check `--trace-config` sets arguments properly +SERVER_ARGS="--trace-config=triton,file=bls_trace.log --trace-config=level=TIMESTAMPS \ + --trace-config=rate=4 --trace-config=count=6 --trace-config=mode=triton --model-repository=$MODELSDIR" +SERVER_LOG="./inference_server_trace_config.log" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +get_trace_setting "simple" +assert_curl_success "Failed to obtain trace settings for 'simple' model" + +if [ `grep -c "\"trace_level\":\[\"TIMESTAMPS\"\]" ./curl.out` != "1" ]; then + RET=1 +fi +if [ `grep -c "\"trace_rate\":\"4\"" ./curl.out` != "1" ]; then + RET=1 +fi +if [ `grep -c "\"trace_count\":\"6\"" ./curl.out` != "1" ]; then + RET=1 +fi +if [ `grep -c "\"log_frequency\":\"0\"" ./curl.out` != "1" ]; then + RET=1 +fi +if [ `grep -c "\"trace_file\":\"bls_trace.log\"" ./curl.out` != "1" ]; then + RET=1 +fi + +set +e +# Send bls requests to make sure simple model is traced +for p in {1..4}; do + python -c 'import opentelemetry_unittest; \ + opentelemetry_unittest.send_bls_request(model_name="ensemble_add_sub_int32_int32_int32")' >> client_update.log 2>&1 +done + +set -e + +kill $SERVER_PID +wait $SERVER_PID + +set +e + +$TRACE_SUMMARY -t bls_trace.log > summary_bls.log + +if [ `grep -c "COMPUTE_INPUT_END" summary_bls.log` != "2" ]; then + cat summary_bls.log + echo -e "\n***\n*** Test Failed: Unexpected number of traced "COMPUTE_INPUT_END" events.\n***" + RET=1 +fi + +if [ `grep -c ^ensemble_add_sub_int32_int32_int32 summary_bls.log` != "1" ]; then + cat summary_bls.log + echo -e "\n***\n*** Test Failed: BLS child ensemble model wasn't traced. \n***" + RET=1 +fi + +if [ `grep -c ^simple summary_bls.log` != "1" ]; then + cat summary_bls.log + echo -e "\n***\n*** Test Failed: ensemble's model 'simple' wasn't traced. \n***" + RET=1 +fi + +if [ `grep -o 'parent_id' bls_trace.log | wc -l` != "2" ]; then + cat bls_trace.log + echo -e "\n***\n*** Test Failed: Unexpected number of 'parent id' fields. \n***" + RET=1 +fi + +# Attempt to trace non-existent model +SERVER_ARGS="--model-control-mode=explicit --model-repository=$MODELSDIR" +SERVER_LOG="./inference_server_nonexistent_model.log" +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +# Explicitly load model +rm -f ./curl.out +set +e +code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/repository/models/simple/load` +set -e +assert_curl_success "Failed to load 'simple' model" + +# Non-existent model (get) +get_trace_setting "does-not-exist" +assert_curl_failure "Server returned trace settings for a non-existent model" + +# Non-existent model (post) +update_trace_setting "does-not-exist" '{"log_frequency":"1"}' +assert_curl_failure "Server modified trace settings for a non-existent model" + +# Local model (get) +get_trace_setting "simple" +assert_curl_success "Failed to obtain trace settings for 'simple' model" + +# Local model (post) +update_trace_setting "simple" '{"log_frequency":"1"}' +assert_curl_success "Failed to modify trace settings for 'simple' model" + +# Local model (unload) +rm -f ./curl.out +set +e +code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/repository/models/simple/unload` +set -e +assert_curl_success "Failed to unload 'simple' model" + +get_trace_setting "simple" +assert_curl_failure "Server returned trace settings for an unloaded model" + +update_trace_setting "simple" '{"log_frequency":"1"}' +assert_curl_failure "Server modified trace settings for an unloaded model" + +# Local model (reload) +rm -f ./curl.out +set +e +code=`curl -s -w %{http_code} -o ./curl.out -X POST localhost:8000/v2/repository/models/simple/load` +set -e +assert_curl_success "Failed to load 'simple' model" + +get_trace_setting "simple" +assert_curl_success "Failed to obtain trace settings for 'simple' model" + +update_trace_setting "simple" '{"log_frequency":"1"}' +assert_curl_success "Failed to modify trace settings for 'simple' model" + +kill $SERVER_PID +wait $SERVER_PID + +set +e + +# Check opentelemetry trace exporter sends proper info. +# A helper python script starts listening on $OTLP_PORT, where +# OTLP exporter sends traces. +export TRITON_OPENTELEMETRY_TEST='false' +OTLP_PORT=10000 +OTEL_COLLECTOR_DIR=./opentelemetry-collector +OTEL_COLLECTOR=./opentelemetry-collector/bin/otelcorecol_* +OTEL_COLLECTOR_LOG="./trace_collector_http_exporter.log" + +# Building the latest version of the OpenTelemetry collector. +# Ref: https://opentelemetry.io/docs/collector/getting-started/#local +if [ -d "$OTEL_COLLECTOR_DIR" ]; then rm -Rf $OTEL_COLLECTOR_DIR; fi +git clone --depth 1 --branch v0.82.0 https://github.com/open-telemetry/opentelemetry-collector.git +cd $OTEL_COLLECTOR_DIR +make install-tools +make otelcorecol +cd .. +$OTEL_COLLECTOR --config ./trace-config.yaml >> $OTEL_COLLECTOR_LOG 2>&1 & COLLECTOR_PID=$! + + +SERVER_ARGS="--trace-config=level=TIMESTAMPS --trace-config=rate=1 \ + --trace-config=count=100 --trace-config=mode=opentelemetry \ + --trace-config=opentelemetry,url=localhost:$OTLP_PORT/v1/traces \ + --model-repository=$MODELSDIR" +SERVER_LOG="./inference_server_otel_http_exporter.log" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +$SIMPLE_HTTP_CLIENT >>$CLIENT_LOG 2>&1 + +set -e + +kill $SERVER_PID +wait $SERVER_PID + +kill $COLLECTOR_PID +wait $COLLECTOR_PID + +set +e + +if ! [[ -s $OTEL_COLLECTOR_LOG && `grep -c 'InstrumentationScope triton-server' $OTEL_COLLECTOR_LOG` == 3 ]] ; then + echo -e "\n***\n*** HTTP exporter test failed.\n***" + cat $OTEL_COLLECTOR_LOG + exit 1 fi +# Unittests then check that produced spans have expected format and events +OPENTELEMETRY_TEST=opentelemetry_unittest.py +OPENTELEMETRY_LOG="opentelemetry_unittest.log" +EXPECTED_NUM_TESTS="3" + +export TRITON_OPENTELEMETRY_TEST='true' + +SERVER_ARGS="--trace-config=level=TIMESTAMPS --trace-config=rate=1 \ + --trace-config=count=100 --trace-config=mode=opentelemetry \ + --trace-config=opentelemetry,resource=test.key=test.value \ + --trace-config=opentelemetry,resource=service.name=test_triton \ + --model-repository=$MODELSDIR" +SERVER_LOG="./inference_server_otel_ostream_exporter.log" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +# Preparing traces for unittest. +# Note: running this separately, so that I could extract spans with `grep` +# from server log later. +python -c 'import opentelemetry_unittest; \ + opentelemetry_unittest.prepare_traces()' >>$CLIENT_LOG 2>&1 + +sleep 5 + +set -e + +kill $SERVER_PID +wait $SERVER_PID + +set +e + +grep -z -o -P '({\n(?s).*}\n)' $SERVER_LOG >> trace_collector.log + +if ! [ -s trace_collector.log ] ; then + echo -e "\n***\n*** $SERVER_LOG did not contain any OpenTelemetry spans.\n***" + exit 1 +fi + +# Unittest will not start until expected number of spans is collected. +python $OPENTELEMETRY_TEST >>$OPENTELEMETRY_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $OPENTELEMETRY_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $OPENTELEMETRY_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi + exit $RET diff --git a/qa/L0_trace/trace-config.yaml b/qa/L0_trace/trace-config.yaml new file mode 100644 index 0000000000..f8fe2424c0 --- /dev/null +++ b/qa/L0_trace/trace-config.yaml @@ -0,0 +1,45 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +# Simple config file for OpenTelemetry collector. +# It receives all traces, received on localhost:10000 and prints +# it into the output stream. +# Ref: https://opentelemetry.io/docs/collector/configuration/ +receivers: + otlp: + protocols: + http: + endpoint: 0.0.0.0:10000 + +exporters: + logging: + verbosity: detailed + +service: + pipelines: + traces: + receivers: [otlp] + exporters: [logging] diff --git a/qa/L0_trace/trace_endpoint_test.py b/qa/L0_trace/trace_endpoint_test.py old mode 100644 new mode 100755 index c836e03e8f..70066dd3b2 --- a/qa/L0_trace/trace_endpoint_test.py +++ b/qa/L0_trace/trace_endpoint_test.py @@ -1,6 +1,6 @@ #!/usr/bin/python -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -27,21 +27,21 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -import numpy as np +import json import sys import unittest -import tritonclient.http as httpclient + +import test_util as tu import tritonclient.grpc as grpcclient -import json +import tritonclient.http as httpclient from google.protobuf import json_format -import test_util as tu # Similar set up as dynamic batcher tests class TraceEndpointTest(tu.TestResultCollector): - def tearDown(self): # Clear all trace settings to initial state. # Note that the tearDown function uses HTTP client so the pass/fail @@ -53,13 +53,13 @@ def tearDown(self): "trace_level": None, "trace_rate": None, "trace_count": None, - "log_frequency": None + "log_frequency": None, } triton_client = httpclient.InferenceServerClient("localhost:8000") - triton_client.update_trace_settings(model_name="simple", - settings=clear_settings) - triton_client.update_trace_settings(model_name=None, - settings=clear_settings) + triton_client.update_trace_settings( + model_name="simple", settings=clear_settings + ) + triton_client.update_trace_settings(model_name=None, settings=clear_settings) def check_server_initial_state(self): # Helper function to make sure the trace setting is properly @@ -72,11 +72,12 @@ def check_server_initial_state(self): "trace_level": ["TIMESTAMPS"], "trace_rate": "1", "trace_count": "-1", - "log_frequency": "0" + "log_frequency": "0", } triton_client = httpclient.InferenceServerClient("localhost:8000") - self.assertEqual(initial_settings, - triton_client.get_trace_settings(model_name="simple")) + self.assertEqual( + initial_settings, triton_client.get_trace_settings(model_name="simple") + ) self.assertEqual(initial_settings, triton_client.get_trace_settings()) def test_http_get_settings(self): @@ -87,46 +88,64 @@ def test_http_get_settings(self): "trace_level": ["TIMESTAMPS"], "trace_rate": "1", "trace_count": "-1", - "log_frequency": "0" + "log_frequency": "0", } triton_client = httpclient.InferenceServerClient("localhost:8000") - self.assertEqual(initial_settings, - triton_client.get_trace_settings(model_name="simple"), - "Unexpected initial model trace settings") - self.assertEqual(initial_settings, triton_client.get_trace_settings(), - "Unexpected initial global settings") + self.assertEqual( + initial_settings, + triton_client.get_trace_settings(model_name="simple"), + "Unexpected initial model trace settings", + ) + self.assertEqual( + initial_settings, + triton_client.get_trace_settings(), + "Unexpected initial global settings", + ) + try: + triton_client.get_trace_settings(model_name="does-not-exist") + except Exception as ex: + self.assertIn( + "Request for unknown model : does-not-exist", + ex.message(), + ) def test_grpc_get_settings(self): # Model trace settings will be the same as global trace settings since # no update has been made. initial_settings = grpcclient.service_pb2.TraceSettingResponse() json_format.Parse( - json.dumps({ - "settings": { - "trace_file": { - "value": ["global_unittest.log"] - }, - "trace_level": { - "value": ["TIMESTAMPS"] - }, - "trace_rate": { - "value": ["1"] - }, - "trace_count": { - "value": ["-1"] - }, - "log_frequency": { - "value": ["0"] - }, + json.dumps( + { + "settings": { + "trace_file": {"value": ["global_unittest.log"]}, + "trace_level": {"value": ["TIMESTAMPS"]}, + "trace_rate": {"value": ["1"]}, + "trace_count": {"value": ["-1"]}, + "log_frequency": {"value": ["0"]}, + } } - }), initial_settings) + ), + initial_settings, + ) triton_client = grpcclient.InferenceServerClient("localhost:8001") - self.assertEqual(initial_settings, - triton_client.get_trace_settings(model_name="simple"), - "Unexpected initial model trace settings") - self.assertEqual(initial_settings, triton_client.get_trace_settings(), - "Unexpected initial global settings") + self.assertEqual( + initial_settings, + triton_client.get_trace_settings(model_name="simple"), + "Unexpected initial model trace settings", + ) + self.assertEqual( + initial_settings, + triton_client.get_trace_settings(), + "Unexpected initial global settings", + ) + try: + triton_client.get_trace_settings(model_name="does-not-exist") + except Exception as ex: + self.assertIn( + "Request for unknown model : does-not-exist", + ex.message(), + ) def test_http_update_settings(self): # Update model and global trace settings in order, @@ -139,47 +158,60 @@ def test_http_update_settings(self): "trace_level": ["TIMESTAMPS"], "trace_rate": "1", "trace_count": "-1", - "log_frequency": "0" + "log_frequency": "0", } expected_second_model_settings = { "trace_file": "model.log", "trace_level": ["TIMESTAMPS", "TENSORS"], "trace_rate": "1", "trace_count": "-1", - "log_frequency": "0" + "log_frequency": "0", } expected_global_settings = { "trace_file": "another.log", "trace_level": ["TIMESTAMPS", "TENSORS"], "trace_rate": "1", "trace_count": "-1", - "log_frequency": "0" + "log_frequency": "0", } model_update_settings = {"trace_file": "model.log"} global_update_settings = { "trace_file": "another.log", - "trace_level": ["TIMESTAMPS", "TENSORS"] + "trace_level": ["TIMESTAMPS", "TENSORS"], } triton_client = httpclient.InferenceServerClient("localhost:8000") self.assertEqual( expected_first_model_settings, - triton_client.update_trace_settings(model_name="simple", - settings=model_update_settings), - "Unexpected updated model trace settings") + triton_client.update_trace_settings( + model_name="simple", settings=model_update_settings + ), + "Unexpected updated model trace settings", + ) # Note that 'trace_level' may be mismatch due to the order of # the levels listed, currently we assume the order is the same # for simplicity. But the order shouldn't be enforced and this checking # needs to be improved when this kind of failure is reported self.assertEqual( expected_global_settings, + triton_client.update_trace_settings(settings=global_update_settings), + "Unexpected updated global settings", + ) + self.assertEqual( + expected_second_model_settings, + triton_client.get_trace_settings(model_name="simple"), + "Unexpected model trace settings after global update", + ) + try: triton_client.update_trace_settings( - settings=global_update_settings), - "Unexpected updated global settings") - self.assertEqual(expected_second_model_settings, - triton_client.get_trace_settings(model_name="simple"), - "Unexpected model trace settings after global update") + model_name="does-not-exist", settings=model_update_settings + ) + except Exception as ex: + self.assertIn( + "Request for unknown model : does-not-exist", + ex.message(), + ) def test_grpc_update_settings(self): # Update model and global trace settings in order, @@ -187,98 +219,91 @@ def test_grpc_update_settings(self): # the model setting fields that haven't been specified. self.check_server_initial_state() - expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse( - ) + expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse() json_format.Parse( - json.dumps({ - "settings": { - "trace_file": { - "value": ["model.log"] - }, - "trace_level": { - "value": ["TIMESTAMPS"] - }, - "trace_rate": { - "value": ["1"] - }, - "trace_count": { - "value": ["-1"] - }, - "log_frequency": { - "value": ["0"] - }, + json.dumps( + { + "settings": { + "trace_file": {"value": ["model.log"]}, + "trace_level": {"value": ["TIMESTAMPS"]}, + "trace_rate": {"value": ["1"]}, + "trace_count": {"value": ["-1"]}, + "log_frequency": {"value": ["0"]}, + } } - }), expected_first_model_settings) - - expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse( + ), + expected_first_model_settings, ) + + expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse() json_format.Parse( - json.dumps({ - "settings": { - "trace_file": { - "value": ["model.log"] - }, - "trace_level": { - "value": ["TIMESTAMPS", "TENSORS"] - }, - "trace_rate": { - "value": ["1"] - }, - "trace_count": { - "value": ["-1"] - }, - "log_frequency": { - "value": ["0"] - }, + json.dumps( + { + "settings": { + "trace_file": {"value": ["model.log"]}, + "trace_level": {"value": ["TIMESTAMPS", "TENSORS"]}, + "trace_rate": {"value": ["1"]}, + "trace_count": {"value": ["-1"]}, + "log_frequency": {"value": ["0"]}, + } } - }), expected_second_model_settings) + ), + expected_second_model_settings, + ) expected_global_settings = grpcclient.service_pb2.TraceSettingResponse() json_format.Parse( - json.dumps({ - "settings": { - "trace_file": { - "value": ["another.log"] - }, - "trace_level": { - "value": ["TIMESTAMPS", "TENSORS"] - }, - "trace_rate": { - "value": ["1"] - }, - "trace_count": { - "value": ["-1"] - }, - "log_frequency": { - "value": ["0"] - }, + json.dumps( + { + "settings": { + "trace_file": {"value": ["another.log"]}, + "trace_level": {"value": ["TIMESTAMPS", "TENSORS"]}, + "trace_rate": {"value": ["1"]}, + "trace_count": {"value": ["-1"]}, + "log_frequency": {"value": ["0"]}, + } } - }), expected_global_settings) + ), + expected_global_settings, + ) model_update_settings = {"trace_file": "model.log"} global_update_settings = { "trace_file": "another.log", - "trace_level": ["TIMESTAMPS", "TENSORS"] + "trace_level": ["TIMESTAMPS", "TENSORS"], } triton_client = grpcclient.InferenceServerClient("localhost:8001") self.assertEqual( expected_first_model_settings, - triton_client.update_trace_settings(model_name="simple", - settings=model_update_settings), - "Unexpected updated model trace settings") + triton_client.update_trace_settings( + model_name="simple", settings=model_update_settings + ), + "Unexpected updated model trace settings", + ) # Note that 'trace_level' may be mismatch due to the order of # the levels listed, currently we assume the order is the same # for simplicity. But the order shouldn't be enforced and this checking # needs to be improved when this kind of failure is reported self.assertEqual( expected_global_settings, + triton_client.update_trace_settings(settings=global_update_settings), + "Unexpected updated global settings", + ) + self.assertEqual( + expected_second_model_settings, + triton_client.get_trace_settings(model_name="simple"), + "Unexpected model trace settings after global update", + ) + try: triton_client.update_trace_settings( - settings=global_update_settings), - "Unexpected updated global settings") - self.assertEqual(expected_second_model_settings, - triton_client.get_trace_settings(model_name="simple"), - "Unexpected model trace settings after global update") + model_name="does-not-exist", settings=model_update_settings + ) + except Exception as ex: + self.assertIn( + "Request for unknown model : does-not-exist", + ex.message(), + ) def test_http_clear_settings(self): # Clear global and model trace settings in order, @@ -290,37 +315,33 @@ def test_http_clear_settings(self): # model 'simple' has 'trace_rate' and 'log_frequency' specified # global has 'trace_level', 'trace_count' and 'trace_rate' specified triton_client = httpclient.InferenceServerClient("localhost:8000") - triton_client.update_trace_settings(model_name="simple", - settings={ - "trace_rate": "12", - "log_frequency": "34" - }) - triton_client.update_trace_settings(settings={ - "trace_rate": "56", - "trace_count": "78", - "trace_level": ["OFF"] - }) + triton_client.update_trace_settings( + model_name="simple", settings={"trace_rate": "12", "log_frequency": "34"} + ) + triton_client.update_trace_settings( + settings={"trace_rate": "56", "trace_count": "78", "trace_level": ["OFF"]} + ) expected_global_settings = { "trace_file": "global_unittest.log", "trace_level": ["OFF"], "trace_rate": "1", "trace_count": "-1", - "log_frequency": "0" + "log_frequency": "0", } expected_first_model_settings = { "trace_file": "global_unittest.log", "trace_level": ["OFF"], "trace_rate": "12", "trace_count": "-1", - "log_frequency": "34" + "log_frequency": "34", } expected_second_model_settings = { "trace_file": "global_unittest.log", "trace_level": ["OFF"], "trace_rate": "1", "trace_count": "-1", - "log_frequency": "34" + "log_frequency": "34", } global_clear_settings = {"trace_rate": None, "trace_count": None} model_clear_settings = {"trace_rate": None, "trace_level": None} @@ -329,18 +350,25 @@ def test_http_clear_settings(self): self.assertEqual( expected_global_settings, triton_client.update_trace_settings(settings=global_clear_settings), - "Unexpected cleared global trace settings") - self.assertEqual(expected_first_model_settings, - triton_client.get_trace_settings(model_name="simple"), - "Unexpected model trace settings after global clear") + "Unexpected cleared global trace settings", + ) + self.assertEqual( + expected_first_model_settings, + triton_client.get_trace_settings(model_name="simple"), + "Unexpected model trace settings after global clear", + ) self.assertEqual( expected_second_model_settings, - triton_client.update_trace_settings(model_name="simple", - settings=model_clear_settings), - "Unexpected model trace settings after model clear") - self.assertEqual(expected_global_settings, - triton_client.get_trace_settings(), - "Unexpected global trace settings after model clear") + triton_client.update_trace_settings( + model_name="simple", settings=model_clear_settings + ), + "Unexpected model trace settings after model clear", + ) + self.assertEqual( + expected_global_settings, + triton_client.get_trace_settings(), + "Unexpected global trace settings after model clear", + ) def test_grpc_clear_settings(self): # Clear global and model trace settings in order, @@ -352,82 +380,58 @@ def test_grpc_clear_settings(self): # model 'simple' has 'trace_rate' and 'log_frequency' specified # global has 'trace_level', 'trace_count' and 'trace_rate' specified triton_client = grpcclient.InferenceServerClient("localhost:8001") - triton_client.update_trace_settings(model_name="simple", - settings={ - "trace_rate": "12", - "log_frequency": "34" - }) - triton_client.update_trace_settings(settings={ - "trace_rate": "56", - "trace_count": "78", - "trace_level": ["OFF"] - }) + triton_client.update_trace_settings( + model_name="simple", settings={"trace_rate": "12", "log_frequency": "34"} + ) + triton_client.update_trace_settings( + settings={"trace_rate": "56", "trace_count": "78", "trace_level": ["OFF"]} + ) expected_global_settings = grpcclient.service_pb2.TraceSettingResponse() json_format.Parse( - json.dumps({ - "settings": { - "trace_file": { - "value": ["global_unittest.log"] - }, - "trace_level": { - "value": ["OFF"] - }, - "trace_rate": { - "value": ["1"] - }, - "trace_count": { - "value": ["-1"] - }, - "log_frequency": { - "value": ["0"] - }, + json.dumps( + { + "settings": { + "trace_file": {"value": ["global_unittest.log"]}, + "trace_level": {"value": ["OFF"]}, + "trace_rate": {"value": ["1"]}, + "trace_count": {"value": ["-1"]}, + "log_frequency": {"value": ["0"]}, + } } - }), expected_global_settings) - expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse( + ), + expected_global_settings, ) + expected_first_model_settings = grpcclient.service_pb2.TraceSettingResponse() json_format.Parse( - json.dumps({ - "settings": { - "trace_file": { - "value": ["global_unittest.log"] - }, - "trace_level": { - "value": ["OFF"] - }, - "trace_rate": { - "value": ["12"] - }, - "trace_count": { - "value": ["-1"] - }, - "log_frequency": { - "value": ["34"] - }, + json.dumps( + { + "settings": { + "trace_file": {"value": ["global_unittest.log"]}, + "trace_level": {"value": ["OFF"]}, + "trace_rate": {"value": ["12"]}, + "trace_count": {"value": ["-1"]}, + "log_frequency": {"value": ["34"]}, + } } - }), expected_first_model_settings) - expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse( + ), + expected_first_model_settings, ) + expected_second_model_settings = grpcclient.service_pb2.TraceSettingResponse() json_format.Parse( - json.dumps({ - "settings": { - "trace_file": { - "value": ["global_unittest.log"] - }, - "trace_level": { - "value": ["OFF"] - }, - "trace_rate": { - "value": ["1"] - }, - "trace_count": { - "value": ["-1"] - }, - "log_frequency": { - "value": ["34"] - }, + json.dumps( + { + "settings": { + "trace_file": {"value": ["global_unittest.log"]}, + "trace_level": {"value": ["OFF"]}, + "trace_rate": {"value": ["1"]}, + "trace_count": {"value": ["-1"]}, + "log_frequency": {"value": ["34"]}, + } } - }), expected_second_model_settings) + ), + expected_second_model_settings, + ) global_clear_settings = {"trace_rate": None, "trace_count": None} model_clear_settings = {"trace_rate": None, "trace_level": None} @@ -436,19 +440,26 @@ def test_grpc_clear_settings(self): self.assertEqual( expected_global_settings, triton_client.update_trace_settings(settings=global_clear_settings), - "Unexpected cleared global trace settings") - self.assertEqual(expected_first_model_settings, - triton_client.get_trace_settings(model_name="simple"), - "Unexpected model trace settings after global clear") + "Unexpected cleared global trace settings", + ) + self.assertEqual( + expected_first_model_settings, + triton_client.get_trace_settings(model_name="simple"), + "Unexpected model trace settings after global clear", + ) self.assertEqual( expected_second_model_settings, - triton_client.update_trace_settings(model_name="simple", - settings=model_clear_settings), - "Unexpected model trace settings after model clear") - self.assertEqual(expected_global_settings, - triton_client.get_trace_settings(), - "Unexpected global trace settings after model clear") + triton_client.update_trace_settings( + model_name="simple", settings=model_clear_settings + ), + "Unexpected model trace settings after model clear", + ) + self.assertEqual( + expected_global_settings, + triton_client.get_trace_settings(), + "Unexpected global trace settings after model clear", + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_triton_repo_agent/test.sh b/qa/L0_triton_repo_agent/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_trt_compat/test.sh b/qa/L0_trt_compat/test.sh new file mode 100755 index 0000000000..6b4f83cbc8 --- /dev/null +++ b/qa/L0_trt_compat/test.sh @@ -0,0 +1,110 @@ +#!/bin/bash +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi + +TEST_RESULT_FILE='test_results.txt' +COMPATIBILITY_TEST_PY=trt_compatibility_test.py +CLIENT_LOG="client.log" +DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"} +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_ARGS="--model-repository=`pwd`/models --exit-timeout-secs=120" +SERVER_LOG="./inference_server.log" +source ../common/util.sh + +rm -fr models && mkdir models +cp -r $DATADIR/qa_identity_model_repository/plan_compatible_zero_1_float32 models/. + +RET=0 + +if [ `ps | grep -c "tritonserver"` != "0" ]; then + echo -e "Tritonserver already running" + echo -e `ps | grep tritonserver` + exit 1 +fi + +run_server +if [ "$SERVER_PID" != "0" ]; then + cat $SERVER_LOG + echo -e "\n***\n*** FAILED: unexpected server start (version compatibility disabled): $SERVER\n***" >> $CLIENT_LOG + kill $SERVER_PID + wait $SERVER_PID + exit 1 +fi + +EXPECTED_ERR="Internal Error (Cannot deserialize engine with lean runtime" +if ! grep "$EXPECTED_ERR" $SERVER_LOG; then + cat $SERVER_LOG + echo -e "\n***\n*** Failed to find expected error: ${EXPECTED_ERR} \n***" + RET=1 +fi + +SERVER_ARGS="--model-repository=`pwd`/models --exit-timeout-secs=120 --backend-config=tensorrt,version-compatible=true" + +run_server +if [ "$SERVER_PID" == "0" ]; then + cat $SERVER_LOG + echo -e "\n***\n*** FAILED: unsuccessful server start (version compatibility enabled): $SERVER\n***" + exit 1 +fi + +set +e + +python $COMPATIBILITY_TEST_PY >$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi + +set -e + +kill $SERVER_PID +wait $SERVER_PID + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test FAILED\n***" +fi + +exit $RET diff --git a/qa/L0_trt_compat/trt_compatibility_test.py b/qa/L0_trt_compat/trt_compatibility_test.py new file mode 100755 index 0000000000..6991299a4c --- /dev/null +++ b/qa/L0_trt_compat/trt_compatibility_test.py @@ -0,0 +1,50 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import unittest + +import infer_util as iu +import numpy as np +import test_util as tu + + +class TrtCompatibilityTest(tu.TestResultCollector): + def setUp(self): + self._data_type = np.float32 + + def test_plan(self): + # plan_compatible_zero_1_float32 is an identity model with input shape [-1] + iu.infer_zero(self, "plan_compatible", 1, self._data_type, [[2, 4]], [[2, 4]]) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_trt_data_dependent_shape/test.sh b/qa/L0_trt_data_dependent_shape/test.sh new file mode 100755 index 0000000000..61efb053f8 --- /dev/null +++ b/qa/L0_trt_data_dependent_shape/test.sh @@ -0,0 +1,94 @@ +#!/bin/bash +# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +REPO_VERSION=${NVIDIA_TRITON_SERVER_VERSION} +if [ "$#" -ge 1 ]; then + REPO_VERSION=$1 +fi +if [ -z "$REPO_VERSION" ]; then + echo -e "Repository version must be specified" + echo -e "\n***\n*** Test Failed\n***" + exit 1 +fi +if [ ! -z "$TEST_REPO_ARCH" ]; then + REPO_VERSION=${REPO_VERSION}_${TEST_REPO_ARCH} +fi + +TEST_RESULT_FILE='test_results.txt' +export CUDA_VISIBLE_DEVICES=0 + +TRT_TEST=trt_data_dependent_shape_test.py + +DATADIR="./models" + +rm -rf ${DATADIR} +cp -r /data/inferenceserver/${REPO_VERSION}/qa_trt_data_dependent_model_repository/ ${DATADIR} + +source ../common/util.sh + +rm -f *.log* + +RET=0 + +CLIENT_LOG="./client.log" +SERVER_LOG="./inference_server.log" +SERVER=/opt/tritonserver/bin/tritonserver +SERVER_ARGS="--model-repository=$DATADIR" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e +python $TRT_TEST >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** Test Failed\n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE 2 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill $SERVER_PID +wait $SERVER_PID + +if [ $RET -eq 0 ]; then + echo -e "\n***\n*** Test Passed\n***" +else + echo -e "\n***\n*** Test Failed\n***" +fi + +exit $RET diff --git a/qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py b/qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py new file mode 100755 index 0000000000..ee0b675d84 --- /dev/null +++ b/qa/L0_trt_data_dependent_shape/trt_data_dependent_shape_test.py @@ -0,0 +1,85 @@ +#!/usr/bin/env python3 + +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import sys + +sys.path.append("../common") + +import unittest + +import numpy as np +import test_util as tu +import tritonclient.http as client + + +class TrtDataDependentShapeTest(tu.TestResultCollector): + def setUp(self): + self.triton_client = client.InferenceServerClient( + "localhost:8000", verbose=True + ) + + def test_fixed(self): + model_name = "plan_nobatch_nonzero_fixed" + input_np = np.arange(16, dtype=np.int32).reshape((4, 4)) + expected_output_np = np.nonzero(input_np) + + inputs = [] + inputs.append(client.InferInput("INPUT", [4, 4], "INT32")) + inputs[-1].set_data_from_numpy(input_np) + + results = self.triton_client.infer(model_name=model_name, inputs=inputs) + # Validate the results by comparing with precomputed values. + output_np = results.as_numpy("OUTPUT") + self.assertTrue( + np.array_equal(output_np, expected_output_np), + "OUTPUT expected: {}, got {}".format(expected_output_np, output_np), + ) + + def test_dynamic(self): + model_name = "plan_nobatch_nonzero_dynamic" + input_data = [] + for i in range(20 * 16): + input_data.append(i if (i % 2) == 0 else 0) + input_np = np.array(input_data, dtype=np.int32).reshape((20, 16)) + expected_output_np = np.nonzero(input_np) + + inputs = [] + inputs.append(client.InferInput("INPUT", [20, 16], "INT32")) + inputs[-1].set_data_from_numpy(input_np) + + results = self.triton_client.infer(model_name=model_name, inputs=inputs) + # Validate the results by comparing with precomputed values. + output_np = results.as_numpy("OUTPUT") + self.assertTrue( + np.array_equal(output_np, expected_output_np), + "OUTPUT expected: {}, got {}".format(expected_output_np, output_np), + ) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_trt_dla/dla_test.py b/qa/L0_trt_dla/dla_test.py old mode 100644 new mode 100755 index ec4f687c47..d71d277ac4 --- a/qa/L0_trt_dla/dla_test.py +++ b/qa/L0_trt_dla/dla_test.py @@ -1,5 +1,5 @@ #!/usr/bin/env python -# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,26 +26,25 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") import unittest + import numpy as np -from PIL import Image import test_util as tu - import tritonclient.http as httpclient -from tritonclient.utils import InferenceServerException +from PIL import Image class InferTest(tu.TestResultCollector): - def _preprocess(self, img, dtype): """ Pre-process an image to meet the size and type requirements specified by the parameters. """ - sample_img = img.convert('RGB') + sample_img = img.convert("RGB") resized_img = sample_img.resize((224, 224), Image.BILINEAR) resized = np.array(resized_img) @@ -57,8 +56,7 @@ def _preprocess(self, img, dtype): def test_resnet50(self): try: - triton_client = httpclient.InferenceServerClient( - url="localhost:8000") + triton_client = httpclient.InferenceServerClient(url="localhost:8000") except Exception as e: print("channel creation failed: " + str(e)) sys.exit(1) @@ -74,22 +72,21 @@ def test_resnet50(self): batched_image_data = image_data for i in range(1, batch_size): batched_image_data = np.concatenate( - (batched_image_data, image_data), axis=0) + (batched_image_data, image_data), axis=0 + ) inputs = [ - httpclient.InferInput('input_tensor_0', [batch_size, 3, 224, 224], - 'INT8') + httpclient.InferInput("input_tensor_0", [batch_size, 3, 224, 224], "INT8") ] inputs[0].set_data_from_numpy(batched_image_data, binary_data=True) outputs = [ - httpclient.InferRequestedOutput('topk_layer_output_index', - binary_data=True) + httpclient.InferRequestedOutput("topk_layer_output_index", binary_data=True) ] results = triton_client.infer(model_name, inputs, outputs=outputs) - output_data = results.as_numpy('topk_layer_output_index') + output_data = results.as_numpy("topk_layer_output_index") print(output_data) # Validate the results by comparing with precomputed values. @@ -99,5 +96,5 @@ def test_resnet50(self): self.assertEqual(output_data[i][0][0], EXPECTED_CLASS_INDEX) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_trt_dla/test.sh b/qa/L0_trt_dla/test.sh old mode 100644 new mode 100755 diff --git a/qa/L0_trt_dynamic_shape/test.sh b/qa/L0_trt_dynamic_shape/test.sh index 99ecc7f2b8..43a39dd199 100755 --- a/qa/L0_trt_dynamic_shape/test.sh +++ b/qa/L0_trt_dynamic_shape/test.sh @@ -305,7 +305,7 @@ kill $SERVER_PID wait $SERVER_PID -# Adding test cases for mulitple optimization profiles with static shapes. +# Adding test cases for multiple optimization profiles with static shapes. # Will load only the following profiles with the static shapes: # Profile 7: [1, 33] # Profile 8: [3, 33] diff --git a/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py b/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py old mode 100644 new mode 100755 index 8b01cbc206..d9f890d9b6 --- a/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py +++ b/qa/L0_trt_dynamic_shape/trt_dynamic_shape_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,41 +27,52 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range -from future.utils import iteritems -import os -import shutil -import time import unittest -import numpy as np + import infer_util as iu +import numpy as np import test_util as tu import tritonhttpclient from tritonclientutils import InferenceServerException class TrtDynamicShapeTest(tu.TestResultCollector): - def setUp(self): self.dtype_ = np.float32 - self.model_name_ = 'plan' + self.model_name_ = "plan" def test_load_specific_optimization_profile(self): # Only OP 5 should be available, which only allow batch size 8 tensor_shape = (1,) try: - iu.infer_exact(self, self.model_name_, (1,) + tensor_shape, 1, - self.dtype_, self.dtype_, self.dtype_) + iu.infer_exact( + self, + self.model_name_, + (1,) + tensor_shape, + 1, + self.dtype_, + self.dtype_, + self.dtype_, + ) except InferenceServerException as ex: self.assertTrue( "model expected the shape of dimension 0 to be between 6 and 8 but received 1" - in ex.message()) + in ex.message() + ) try: - iu.infer_exact(self, self.model_name_, (8,) + tensor_shape, 8, - self.dtype_, self.dtype_, self.dtype_) + iu.infer_exact( + self, + self.model_name_, + (8,) + tensor_shape, + 8, + self.dtype_, + self.dtype_, + self.dtype_, + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) @@ -68,37 +81,60 @@ def test_load_default_optimization_profile(self): tensor_shape = (33,) try: - iu.infer_exact(self, self.model_name_, (8,) + tensor_shape, 8, - self.dtype_, self.dtype_, self.dtype_) + iu.infer_exact( + self, + self.model_name_, + (8,) + tensor_shape, + 8, + self.dtype_, + self.dtype_, + self.dtype_, + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) over_tensor_shape = (34,) try: - iu.infer_exact(self, self.model_name_, (8,) + over_tensor_shape, 8, - self.dtype_, self.dtype_, self.dtype_) + iu.infer_exact( + self, + self.model_name_, + (8,) + over_tensor_shape, + 8, + self.dtype_, + self.dtype_, + self.dtype_, + ) except InferenceServerException as ex: self.assertTrue( "model expected the shape of dimension 1 to be between 1 and 33 but received 34" - in ex.message()) + in ex.message() + ) def test_select_optimization_profile(self): # Different profile has different optimized input shape batch_size = 4 tensor_shape = (16,) try: - iu.infer_exact(self, self.model_name_, (batch_size,) + tensor_shape, - batch_size, self.dtype_, self.dtype_, self.dtype_) + iu.infer_exact( + self, + self.model_name_, + (batch_size,) + tensor_shape, + batch_size, + self.dtype_, + self.dtype_, + self.dtype_, + ) except InferenceServerException as ex: self.assertTrue(False, "unexpected error {}".format(ex)) def test_load_wrong_optimization_profile(self): client = tritonhttpclient.InferenceServerClient("localhost:8000") - model_name = tu.get_model_name(self.model_name_, self.dtype_, - self.dtype_, self.dtype_) + model_name = tu.get_model_name( + self.model_name_, self.dtype_, self.dtype_, self.dtype_ + ) model_status = client.is_model_ready(model_name, "1") self.assertFalse(model_status, "expected model to be not ready") -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_trt_error_propagation/test.sh b/qa/L0_trt_error_propagation/test.sh new file mode 100755 index 0000000000..dac3f6349e --- /dev/null +++ b/qa/L0_trt_error_propagation/test.sh @@ -0,0 +1,82 @@ +#!/bin/bash +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +export CUDA_VISIBLE_DEVICES=0 +SERVER=/opt/tritonserver/bin/tritonserver +source ../common/util.sh + +# Create TensorRT model with invalid plan file +rm -rf models && mkdir models +mkdir models/invalid_plan_file && (cd models/invalid_plan_file && \ + echo -e "name: \"invalid_plan_file\"" >> config.pbtxt && \ + echo -e "platform: \"tensorrt_plan\"" >> config.pbtxt && \ + echo -e "input [\n {\n name: \"INPUT\"\n data_type: TYPE_FP32\n dims: [-1]\n }\n ]" >> config.pbtxt && \ + echo -e "output [\n {\n name: \"OUTPUT\"\n data_type: TYPE_FP32\n dims: [-1]\n }\n ]" >> config.pbtxt && \ + mkdir 1 && echo "----- invalid model.plan -----" >> 1/model.plan) + +# Test with and without auto complete enabled +for ENABLE_AUTOCOMPLETE in "YES" "NO"; do + + if [[ "$ENABLE_AUTOCOMPLETE" == "YES" ]]; then + TEST_NAME="test_invalid_trt_model_autocomplete" + SERVER_ARGS="--model-repository=models --model-control-mode=explicit" + else + TEST_NAME="test_invalid_trt_model" + SERVER_ARGS="--model-repository=models --model-control-mode=explicit --disable-auto-complete-config" + fi + + SERVER_LOG="./$TEST_NAME.server.log" + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + RET=0 + + set +e + python trt_error_propagation_test.py TestTrtErrorPropagation.$TEST_NAME > $TEST_NAME.log 2>&1 + if [ $? -ne 0 ]; then + cat $TEST_NAME.log + echo -e "\n***\n*** Test FAILED\n***" + RET=1 + fi + set -e + + kill $SERVER_PID + wait $SERVER_PID + + if [ $RET -ne 0 ]; then + exit $RET + fi + +done + +# Exit with success +echo -e "\n***\n*** Test Passed\n***" +exit 0 diff --git a/qa/L0_trt_error_propagation/trt_error_propagation_test.py b/qa/L0_trt_error_propagation/trt_error_propagation_test.py new file mode 100755 index 0000000000..83527a7533 --- /dev/null +++ b/qa/L0_trt_error_propagation/trt_error_propagation_test.py @@ -0,0 +1,72 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import unittest + +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException + + +class TestTrtErrorPropagation(unittest.TestCase): + def setUp(self): + # Initialize client + self.__triton = grpcclient.InferenceServerClient("localhost:8001", verbose=True) + + def test_invalid_trt_model(self): + with self.assertRaises(InferenceServerException) as cm: + self.__triton.load_model("invalid_plan_file") + err_msg = str(cm.exception) + # All 'expected_msg_parts' should be present in the 'err_msg' in order + expected_msg_parts = [ + "load failed for model", + "version 1 is at UNAVAILABLE state: ", + "Internal: unable to create TensorRT engine: ", + "Error Code ", + "Internal Error ", + ] + for expected_msg_part in expected_msg_parts: + self.assertIn( + expected_msg_part, + err_msg, + "Cannot find an expected part of error message", + ) + _, err_msg = err_msg.split(expected_msg_part) + + def test_invalid_trt_model_autocomplete(self): + with self.assertRaises(InferenceServerException) as cm: + self.__triton.load_model("invalid_plan_file") + err_msg = str(cm.exception) + self.assertIn( + "Internal: unable to load plan file to auto complete config", + err_msg, + "Caught an unexpected exception", + ) + + +if __name__ == "__main__": + unittest.main() diff --git a/qa/L0_trt_plugin/test.sh b/qa/L0_trt_plugin/test.sh old mode 100644 new mode 100755 index 13df59c56f..7ffc7e215d --- a/qa/L0_trt_plugin/test.sh +++ b/qa/L0_trt_plugin/test.sh @@ -43,18 +43,112 @@ export CUDA_VISIBLE_DEVICES=0 CLIENT_LOG="./client.log" PLUGIN_TEST=trt_plugin_test.py -EXPECTED_NUM_TESTS="2" -DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_trt_plugin_model_repository +# On windows the paths invoked by the script (running in WSL) must use +# /mnt/c when needed but the paths on the tritonserver command-line +# must be C:/ style. +if [[ "$(< /proc/sys/kernel/osrelease)" == *microsoft* ]]; then + DATADIR=${DATADIR:="/mnt/c/data/inferenceserver/${REPO_VERSION}"} + MODELDIR=${MODELDIR:=C:/models} + CUSTOMPLUGIN=${CUSTOMPLUGIN:=$MODELDIR/clipplugin.dll} + BACKEND_DIR=${BACKEND_DIR:=C:/tritonserver/backends} + SERVER=${SERVER:=/mnt/c/tritonserver/bin/tritonserver.exe} +else + DATADIR=${DATADIR:="/data/inferenceserver/${REPO_VERSION}"} + MODELDIR=${MODELDIR:=`pwd`/models} + CUSTOMPLUGIN=${CUSTOMPLUGIN:=$MODELDIR/libclipplugin.so} + TRITON_DIR=${TRITON_DIR:="/opt/tritonserver"} + BACKEND_DIR=${TRITON_DIR}/backends + SERVER=${TRITON_DIR}/bin/tritonserver +fi -SERVER=/opt/tritonserver/bin/tritonserver -SERVER_ARGS="--model-repository=$DATADIR --exit-timeout-secs=120" -SERVER_LOG="./inference_server.log" source ../common/util.sh -rm -f $SERVER_LOG $CLIENT_LOG - RET=0 +rm -f ./*.log + +SERVER_ARGS_BASE="--model-repository=${MODELDIR} --backend-directory=${BACKEND_DIR} --log-verbose=1" +SERVER_TIMEOUT=20 + +LOG_IDX=0 + +## Default Plugin Tests + +## Create model folder with default plugin models +rm -fr models && mkdir -p models +set -e +find $DATADIR/qa_trt_plugin_model_repository/ -mindepth 1 -maxdepth 1 ! -iname '*clipplugin*' -exec cp -rv {} models \; + +SERVER_ARGS=$SERVER_ARGS_BASE +SERVER_LOG="./inference_server_$LOG_IDX.log" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +rm -f $CLIENT_LOG +set +e +python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_gelu >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +rm -f $CLIENT_LOG +python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_norm >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill_server + +## Custom Plugin Tests + +## Create model folder with custom plugin models for remaining tests +rm -fr models && mkdir -p models +find $DATADIR/qa_trt_plugin_model_repository/ -maxdepth 1 -iname '*clipplugin*' -exec cp -r {} models \; + +LOG_IDX=$((LOG_IDX+1)) + +## Baseline Failure Test +## Plugin library not loaded +SERVER_ARGS=$SERVER_ARGS_BASE +SERVER_LOG="./inference_server_$LOG_IDX.log" + +run_server +if [ "$SERVER_PID" != "0" ]; then + cat $SERVER_LOG + echo -e "\n***\n*** Test Failed\n" + echo -e "Unexpected successful server start $SERVER\n***" + kill_server + exit 1 +fi + +LOG_IDX=$((LOG_IDX+1)) + +## Backend Config, Single Plugin Test +SERVER_ARGS="${SERVER_ARGS_BASE} --backend-config=tensorrt,plugins=${CUSTOMPLUGIN}" +SERVER_LOG="./inference_server_$LOG_IDX.log" run_server if [ "$SERVER_PID" == "0" ]; then @@ -63,14 +157,15 @@ if [ "$SERVER_PID" == "0" ]; then exit 1 fi +rm -f $CLIENT_LOG set +e -python $PLUGIN_TEST >$CLIENT_LOG 2>&1 +python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_clip >>$CLIENT_LOG 2>&1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Failed\n***" RET=1 else - check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + check_test_results $TEST_RESULT_FILE 1 if [ $? -ne 0 ]; then cat $CLIENT_LOG echo -e "\n***\n*** Test Result Verification Failed\n***" @@ -79,13 +174,80 @@ else fi set -e -kill $SERVER_PID -wait $SERVER_PID +kill_server + +LOG_IDX=$((LOG_IDX+1)) + +## Backend Config, Multiple Plugins Test +SERVER_ARGS="${SERVER_ARGS_BASE} --backend-config=tensorrt,plugins=${CUSTOMPLUGIN}" +SERVER_LOG="./inference_server_$LOG_IDX.log" + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +rm -f $CLIENT_LOG +set +e +python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_clip >>$CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 +else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi +set -e + +kill_server + +LOG_IDX=$((LOG_IDX+1)) + +## LD_PRELOAD, Single Plugin Test +## LD_PRELOAD is only on Linux + +SERVER_LD_PRELOAD=$CUSTOMPLUGIN +SERVER_ARGS=$SERVER_ARGS_BASE +SERVER_LOG="./inference_server_$LOG_IDX.log" + +if [[ "$(< /proc/sys/kernel/osrelease)" != *microsoft* ]]; then + run_server + if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 + fi + + rm -f $CLIENT_LOG + set +e + python3 $PLUGIN_TEST PluginModelTest.test_raw_fff_clip >>$CLIENT_LOG 2>&1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Failed\n***" + RET=1 + else + check_test_results $TEST_RESULT_FILE 1 + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi + fi + set -e + + kill_server +fi if [ $RET -eq 0 ]; then echo -e "\n***\n*** Test Passed\n***" else - cat $CLIENT_LOG echo -e "\n***\n*** Test FAILED\n***" fi diff --git a/qa/L0_trt_plugin/trt_plugin_test.py b/qa/L0_trt_plugin/trt_plugin_test.py old mode 100644 new mode 100755 index fde88244a9..5dcc6318f5 --- a/qa/L0_trt_plugin/trt_plugin_test.py +++ b/qa/L0_trt_plugin/trt_plugin_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2018-2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,65 +27,98 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range -from future.utils import iteritems +import os import unittest + import numpy as np -import os import test_util as tu +import tritonclient.http as httpclient -import tritonhttpclient as httpclient -from tritonclientutils import InferenceServerException +# By default, find tritonserver on "localhost", but can be overridden +# with TRITONSERVER_IPADDR envvar +_tritonserver_ipaddr = os.environ.get("TRITONSERVER_IPADDR", "localhost") class PluginModelTest(tu.TestResultCollector): - def _full_exact(self, model_name, plugin_name, shape): - triton_client = httpclient.InferenceServerClient("localhost:8000", - verbose=True) + print(f"{_tritonserver_ipaddr}:8000") + triton_client = httpclient.InferenceServerClient(f"{_tritonserver_ipaddr}:8000") inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', list(shape), "FP32")) + inputs.append(httpclient.InferInput("INPUT0", list(shape), "FP32")) input0_data = np.ones(shape=shape).astype(np.float32) inputs[0].set_data_from_numpy(input0_data, binary_data=True) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) - - results = triton_client.infer(model_name + '_' + plugin_name, - inputs, - outputs=outputs) - - output0_data = results.as_numpy('OUTPUT0') - - # Verify values of Normalize and GELU - if plugin_name == 'CustomGeluPluginDynamic': + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True)) + + results = triton_client.infer( + model_name + "_" + plugin_name, inputs, outputs=outputs + ) + + output0_data = results.as_numpy("OUTPUT0") + tolerance_relative = 1e-6 + tolerance_absolute = 1e-7 + + # Verify values of Clip, GELU, and Normalize + if plugin_name == "CustomClipPlugin": + # Clip data to minimum of .1, maximum of .5 + test_output = np.clip(input0_data, 0.1, 0.5) + np.testing.assert_allclose( + output0_data, + test_output, + rtol=tolerance_relative, + atol=tolerance_absolute, + ) + elif plugin_name == "CustomGeluPluginDynamic": # Add bias input0_data += 1 # Calculate Gelu activation - test_output = (input0_data * - 0.5) * (1 + np.tanh((0.797885 * input0_data) + - (0.035677 * (input0_data**3)))) - self.assertTrue(np.isclose(output0_data, test_output).all()) - else: + test_output = (input0_data * 0.5) * ( + 1 + np.tanh((0.797885 * input0_data) + (0.035677 * (input0_data**3))) + ) + np.testing.assert_allclose( + output0_data, + test_output, + rtol=tolerance_relative, + atol=tolerance_absolute, + ) + elif plugin_name == "Normalize_TRT": # L2 norm is sqrt(sum([1]*16))) test_output = input0_data / np.sqrt(sum([1] * 16)) - self.assertTrue(np.isclose(output0_data, test_output).all()) + np.testing.assert_allclose( + output0_data, + test_output, + rtol=tolerance_relative, + atol=tolerance_absolute, + ) + else: + self.fail("Unexpected plugin: " + plugin_name) + + def test_raw_fff_clip(self): + for bs in (1, 8): + self._full_exact( + "plan_float32_float32_float32", "CustomClipPlugin", (bs, 16) + ) def test_raw_fff_gelu(self): - self._full_exact('plan_nobatch_float32_float32_float32', - 'CustomGeluPluginDynamic', (16, 1, 1)) + self._full_exact( + "plan_nobatch_float32_float32_float32", + "CustomGeluPluginDynamic", + (16, 1, 1), + ) def test_raw_fff_norm(self): # model that supports batching for bs in (1, 8): - self._full_exact('plan_float32_float32_float32', 'Normalize_TRT', - (bs, 16, 16, 16)) + self._full_exact( + "plan_float32_float32_float32", "Normalize_TRT", (bs, 16, 16, 16) + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_trt_reformat_free/test.sh b/qa/L0_trt_reformat_free/test.sh index c834a05992..ebdc83a5b8 100755 --- a/qa/L0_trt_reformat_free/test.sh +++ b/qa/L0_trt_reformat_free/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -42,7 +42,6 @@ TEST_RESULT_FILE='test_results.txt' export CUDA_VISIBLE_DEVICES=0 CLIENT_LOG="./client.log" -PERF_CLIENT=../clients/perf_client TRT_TEST=trt_reformat_free_test.py DATADIR="./models" diff --git a/qa/L0_trt_reformat_free/trt_reformat_free_test.py b/qa/L0_trt_reformat_free/trt_reformat_free_test.py old mode 100644 new mode 100755 index fedcf62184..ea36f9c24a --- a/qa/L0_trt_reformat_free/trt_reformat_free_test.py +++ b/qa/L0_trt_reformat_free/trt_reformat_free_test.py @@ -1,4 +1,6 @@ -# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,20 +27,16 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range -from future.utils import iteritems -import os -import shutil -import time import unittest +from builtins import range + import numpy as np -import infer_util as iu import test_util as tu -import tritonhttpclient import tritonclient.utils.shared_memory as shm -from tritonclientutils import InferenceServerException +import tritonhttpclient def div_up(a, b): @@ -48,36 +46,40 @@ def div_up(a, b): def reformat(format, tensor_np): if format == "CHW2": factor = 2 - if format == "CHW32": + elif format == "CHW32": factor = 32 + else: + raise ValueError( + "Unexpected format {} for testing reformat-free input".format(format) + ) shape = list(tensor_np.shape) + [factor] shape[-4] = div_up(shape[-4], factor) reformatted_tensor_np = np.empty(shape, tensor_np.dtype) if len(tensor_np.shape) == 3: batch = [(tensor_np, reformatted_tensor_np)] elif len(tensor_np.shape) == 4: - batch = [(tensor_np[idx], reformatted_tensor_np[idx]) - for idx in range(tensor_np.shape[0])] + batch = [ + (tensor_np[idx], reformatted_tensor_np[idx]) + for idx in range(tensor_np.shape[0]) + ] else: raise ValueError( "Unexpected numpy shape {} for testing reformat-free input".format( - tensor_np.shape)) - for (tensor, reformatted_tensor) in batch: + tensor_np.shape + ) + ) + for tensor, reformatted_tensor in batch: for c in range(tensor.shape[0]): for h in range(tensor.shape[1]): for w in range(tensor.shape[2]): - reformatted_tensor[c // - factor][h][w][c % - factor] = tensor[c][h][w] + reformatted_tensor[c // factor][h][w][c % factor] = tensor[c][h][w] return reformatted_tensor_np class TrtReformatFreeTest(tu.TestResultCollector): - def add_reformat_free_data_as_shared_memory(self, name, tensor, tensor_np): byte_size = tensor_np.size * tensor_np.dtype.itemsize - self.shm_handles.append( - shm.create_shared_memory_region(name, name, byte_size)) + self.shm_handles.append(shm.create_shared_memory_region(name, name, byte_size)) # Put data values into shared memory shm.set_shared_memory_region(self.shm_handles[-1], [tensor_np]) # Register shared memory with Triton Server @@ -88,7 +90,8 @@ def add_reformat_free_data_as_shared_memory(self, name, tensor, tensor_np): def setUp(self): self.shm_handles = [] self.triton_client = tritonhttpclient.InferenceServerClient( - "localhost:8000", verbose=True) + "localhost:8000", verbose=True + ) def tearDown(self): self.triton_client.unregister_system_shared_memory() @@ -106,39 +109,42 @@ def test_nobatch_chw2_input(self): # for non-linear format tensor, the data buffer is padded and thus the # data byte size may not match what is calculated from tensor shape inputs = [] - inputs.append(tritonhttpclient.InferInput('INPUT0', [13, 2, 1], "FP16")) - self.add_reformat_free_data_as_shared_memory("input0", inputs[-1], - reformatted_input_np) - inputs.append(tritonhttpclient.InferInput('INPUT1', [13, 2, 1], "FP16")) - self.add_reformat_free_data_as_shared_memory("input1", inputs[-1], - reformatted_input_np) + inputs.append(tritonhttpclient.InferInput("INPUT0", [13, 2, 1], "FP16")) + self.add_reformat_free_data_as_shared_memory( + "input0", inputs[-1], reformatted_input_np + ) + inputs.append(tritonhttpclient.InferInput("INPUT1", [13, 2, 1], "FP16")) + self.add_reformat_free_data_as_shared_memory( + "input1", inputs[-1], reformatted_input_np + ) outputs = [] outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True) + ) outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True) + ) - results = self.triton_client.infer(model_name=model_name, - inputs=inputs, - outputs=outputs) + results = self.triton_client.infer( + model_name=model_name, inputs=inputs, outputs=outputs + ) # Validate the results by comparing with precomputed values. - output0_np = results.as_numpy('OUTPUT0') - output1_np = results.as_numpy('OUTPUT1') + output0_np = results.as_numpy("OUTPUT0") + output1_np = results.as_numpy("OUTPUT1") self.assertTrue( np.array_equal(output0_np, expected_output0_np), - "OUTPUT0 expected: {}, got {}".format(expected_output0_np, - output0_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np), + ) self.assertTrue( np.array_equal(output1_np, expected_output1_np), - "OUTPUT0 expected: {}, got {}".format(expected_output1_np, - output1_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np), + ) def test_chw2_input(self): model_name = "plan_CHW2_LINEAR_float16_float16_float16" for bs in [1, 8]: - input_np = np.arange(26 * bs, dtype=np.float16).reshape( - (bs, 13, 2, 1)) + input_np = np.arange(26 * bs, dtype=np.float16).reshape((bs, 13, 2, 1)) expected_output0_np = input_np + input_np expected_output1_np = input_np - input_np reformatted_input_np = reformat("CHW2", input_np) @@ -148,37 +154,37 @@ def test_chw2_input(self): # and thus the data byte size may not match what is calculated from # tensor shape inputs = [] - inputs.append( - tritonhttpclient.InferInput('INPUT0', [bs, 13, 2, 1], "FP16")) + inputs.append(tritonhttpclient.InferInput("INPUT0", [bs, 13, 2, 1], "FP16")) self.add_reformat_free_data_as_shared_memory( - "input0" + str(bs), inputs[-1], reformatted_input_np) - inputs.append( - tritonhttpclient.InferInput('INPUT1', [bs, 13, 2, 1], "FP16")) + "input0" + str(bs), inputs[-1], reformatted_input_np + ) + inputs.append(tritonhttpclient.InferInput("INPUT1", [bs, 13, 2, 1], "FP16")) self.add_reformat_free_data_as_shared_memory( - "input1" + str(bs), inputs[-1], reformatted_input_np) + "input1" + str(bs), inputs[-1], reformatted_input_np + ) outputs = [] outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT0', - binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True) + ) outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT1', - binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True) + ) - results = self.triton_client.infer(model_name=model_name, - inputs=inputs, - outputs=outputs) + results = self.triton_client.infer( + model_name=model_name, inputs=inputs, outputs=outputs + ) # Validate the results by comparing with precomputed values. - output0_np = results.as_numpy('OUTPUT0') - output1_np = results.as_numpy('OUTPUT1') + output0_np = results.as_numpy("OUTPUT0") + output1_np = results.as_numpy("OUTPUT1") self.assertTrue( np.array_equal(output0_np, expected_output0_np), - "OUTPUT0 expected: {}, got {}".format(expected_output0_np, - output0_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np), + ) self.assertTrue( np.array_equal(output1_np, expected_output1_np), - "OUTPUT0 expected: {}, got {}".format(expected_output1_np, - output1_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np), + ) def test_nobatch_chw32_input(self): model_name = "plan_nobatch_CHW32_LINEAR_float32_float32_float32" @@ -191,39 +197,42 @@ def test_nobatch_chw32_input(self): # for non-linear format tensor, the data buffer is padded and thus the # data byte size may not match what is calculated from tensor shape inputs = [] - inputs.append(tritonhttpclient.InferInput('INPUT0', [13, 2, 1], "FP32")) - self.add_reformat_free_data_as_shared_memory("input0", inputs[-1], - reformatted_input_np) - inputs.append(tritonhttpclient.InferInput('INPUT1', [13, 2, 1], "FP32")) - self.add_reformat_free_data_as_shared_memory("input1", inputs[-1], - reformatted_input_np) + inputs.append(tritonhttpclient.InferInput("INPUT0", [13, 2, 1], "FP32")) + self.add_reformat_free_data_as_shared_memory( + "input0", inputs[-1], reformatted_input_np + ) + inputs.append(tritonhttpclient.InferInput("INPUT1", [13, 2, 1], "FP32")) + self.add_reformat_free_data_as_shared_memory( + "input1", inputs[-1], reformatted_input_np + ) outputs = [] outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True) + ) outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT1', binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True) + ) - results = self.triton_client.infer(model_name=model_name, - inputs=inputs, - outputs=outputs) + results = self.triton_client.infer( + model_name=model_name, inputs=inputs, outputs=outputs + ) # Validate the results by comparing with precomputed values. - output0_np = results.as_numpy('OUTPUT0') - output1_np = results.as_numpy('OUTPUT1') + output0_np = results.as_numpy("OUTPUT0") + output1_np = results.as_numpy("OUTPUT1") self.assertTrue( np.array_equal(output0_np, expected_output0_np), - "OUTPUT0 expected: {}, got {}".format(expected_output0_np, - output0_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np), + ) self.assertTrue( np.array_equal(output1_np, expected_output1_np), - "OUTPUT0 expected: {}, got {}".format(expected_output1_np, - output1_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np), + ) def test_chw32_input(self): model_name = "plan_CHW32_LINEAR_float32_float32_float32" for bs in [1, 8]: - input_np = np.arange(26 * bs, dtype=np.float32).reshape( - (bs, 13, 2, 1)) + input_np = np.arange(26 * bs, dtype=np.float32).reshape((bs, 13, 2, 1)) expected_output0_np = input_np + input_np expected_output1_np = input_np - input_np reformatted_input_np = reformat("CHW32", input_np) @@ -233,38 +242,38 @@ def test_chw32_input(self): # and thus the data byte size may not match what is calculated from # tensor shape inputs = [] - inputs.append( - tritonhttpclient.InferInput('INPUT0', [bs, 13, 2, 1], "FP32")) + inputs.append(tritonhttpclient.InferInput("INPUT0", [bs, 13, 2, 1], "FP32")) self.add_reformat_free_data_as_shared_memory( - "input0" + str(bs), inputs[-1], reformatted_input_np) - inputs.append( - tritonhttpclient.InferInput('INPUT1', [bs, 13, 2, 1], "FP32")) + "input0" + str(bs), inputs[-1], reformatted_input_np + ) + inputs.append(tritonhttpclient.InferInput("INPUT1", [bs, 13, 2, 1], "FP32")) self.add_reformat_free_data_as_shared_memory( - "input1" + str(bs), inputs[-1], reformatted_input_np) + "input1" + str(bs), inputs[-1], reformatted_input_np + ) outputs = [] outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT0', - binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT0", binary_data=True) + ) outputs.append( - tritonhttpclient.InferRequestedOutput('OUTPUT1', - binary_data=True)) + tritonhttpclient.InferRequestedOutput("OUTPUT1", binary_data=True) + ) - results = self.triton_client.infer(model_name=model_name, - inputs=inputs, - outputs=outputs) + results = self.triton_client.infer( + model_name=model_name, inputs=inputs, outputs=outputs + ) # Validate the results by comparing with precomputed values. - output0_np = results.as_numpy('OUTPUT0') - output1_np = results.as_numpy('OUTPUT1') + output0_np = results.as_numpy("OUTPUT0") + output1_np = results.as_numpy("OUTPUT1") self.assertTrue( np.array_equal(output0_np, expected_output0_np), - "OUTPUT0 expected: {}, got {}".format(expected_output0_np, - output0_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output0_np, output0_np), + ) self.assertTrue( np.array_equal(output1_np, expected_output1_np), - "OUTPUT0 expected: {}, got {}".format(expected_output1_np, - output1_np)) + "OUTPUT0 expected: {}, got {}".format(expected_output1_np, output1_np), + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_trt_shape_tensors/test.sh b/qa/L0_trt_shape_tensors/test.sh old mode 100644 new mode 100755 index 9ca0bc958f..eed67d9dcb --- a/qa/L0_trt_shape_tensors/test.sh +++ b/qa/L0_trt_shape_tensors/test.sh @@ -49,7 +49,7 @@ SERVER_ARGS="--model-repository=`pwd`/models" SERVER_LOG="./inference_server.log" source ../common/util.sh -rm -fr *.serverlog *.log *.serverlog +rm -fr *.log rm -fr models && mkdir models cp -r /data/inferenceserver/${REPO_VERSION}/qa_shapetensor_model_repository/* models/. @@ -134,7 +134,7 @@ sed -i "s/^version_policy:.*/version_policy: { specific { versions: [1] }}/" $CO for i in \ test_dynamic_different_shape_values \ test_dynamic_identical_shape_values; do - SERVER_LOG="./$i.serverlog" + SERVER_LOG="./$i.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -169,7 +169,7 @@ for i in \ test_sequence_identical_shape_values ; do export TRITONSERVER_BACKLOG_DELAY_SCHEDULER=0 export TRITONSERVER_DELAY_SCHEDULER=12 - SERVER_LOG="./$i.serverlog" + SERVER_LOG="./$i.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" @@ -215,7 +215,7 @@ for i in \ test_dynaseq_different_shape_values_parallel \ ;do SERVER_ARGS="--model-repository=`pwd`/models" - SERVER_LOG="./$i.serverlog" + SERVER_LOG="./$i.server.log" run_server if [ "$SERVER_PID" == "0" ]; then echo -e "\n***\n*** Failed to start $SERVER\n***" diff --git a/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py b/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py old mode 100644 new mode 100755 index 89a3f889dc..a83795f981 --- a/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py +++ b/qa/L0_trt_shape_tensors/trt_shape_tensor_test.py @@ -1,4 +1,6 @@ -# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,24 +27,22 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") -from builtins import range -from future.utils import iteritems import os -import unittest -import time import threading -import traceback -import numpy as np +import time +import unittest +from builtins import range + import infer_util as iu -import test_util as tu +import numpy as np import sequence_util as su - +import test_util as tu import tritongrpcclient as grpcclient -TEST_SYSTEM_SHARED_MEMORY = bool( - int(os.environ.get('TEST_SYSTEM_SHARED_MEMORY', 0))) +TEST_SYSTEM_SHARED_MEMORY = bool(int(os.environ.get("TEST_SYSTEM_SHARED_MEMORY", 0))) _model_instances = 1 _max_queue_delay_ms = 10000 @@ -53,7 +53,6 @@ class InferShapeTensorTest(tu.TestResultCollector): - def setUp(self): # The helper client for setup will be GRPC for simplicity. self.triton_client_ = grpcclient.InferenceServerClient("localhost:8001") @@ -76,14 +75,16 @@ def check_deferred_exception(self): if len(_deferred_exceptions) > 0: raise _deferred_exceptions[0] - def check_response(self, - bs, - thresholds, - shape_values, - dummy_input_shapes, - shm_region_names=None, - precreated_shm_regions=None, - shm_suffix=""): + def check_response( + self, + bs, + thresholds, + shape_values, + dummy_input_shapes, + shm_region_names=None, + precreated_shm_regions=None, + shm_suffix="", + ): try: # Add batch size to shape as full shape is expected for i in range(len(dummy_input_shapes)): @@ -94,7 +95,7 @@ def check_response(self, iu.infer_shape_tensor( self, - 'plan', + "plan", np.float32, shape_values, dummy_input_shapes, @@ -102,7 +103,8 @@ def check_response(self, use_streaming=False, shm_suffix=shm_suffix, use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, - batch_size=bs) + batch_size=bs, + ) end_ms = int(round(time.time() * 1000)) @@ -111,13 +113,21 @@ def check_response(self, if lt_ms is not None: self.assertTrue( (end_ms - start_ms) < lt_ms, - "expected less than " + str(lt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected less than " + + str(lt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) if gt_ms is not None: self.assertTrue( (end_ms - start_ms) > gt_ms, - "expected greater than " + str(gt_ms) + - "ms response time, got " + str(end_ms - start_ms) + " ms") + "expected greater than " + + str(gt_ms) + + "ms response time, got " + + str(end_ms - start_ms) + + " ms", + ) except Exception as ex: self.add_deferred_exception(ex) @@ -127,109 +137,164 @@ def check_setup(self, model_name): bconfig = config.dynamic_batching self.assertTrue(2 in bconfig.preferred_batch_size) self.assertTrue(6 in bconfig.preferred_batch_size) - self.assertEqual(bconfig.max_queue_delay_microseconds, - _max_queue_delay_ms * 1000) # 10 secs + self.assertEqual( + bconfig.max_queue_delay_microseconds, _max_queue_delay_ms * 1000 + ) # 10 secs def check_status(self, model_name, batch_exec, exec_cnt, infer_cnt): - stats = self.triton_client_.get_inference_statistics(model_name, "1") - self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") - self.assertEqual(stats.model_stats[0].name, model_name, - "expect model stats for model {}".format(model_name)) + # There is a time window between when responses are returned and statistics are updated. + # To prevent intermittent test failure during that window, wait up to 10 seconds for the + # inference statistics to be ready. + num_tries = 10 + for i in range(num_tries): + stats = self.triton_client_.get_inference_statistics(model_name, "1") + self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats") + actual_exec_cnt = stats.model_stats[0].execution_count + if actual_exec_cnt == exec_cnt: + break + print( + "WARNING: expect {} executions, got {} (attempt {})".format( + exec_cnt, actual_exec_cnt, i + ) + ) + time.sleep(1) + self.assertEqual( - stats.model_stats[0].version, "1", - "expect model stats for model {} version 1".format(model_name)) + stats.model_stats[0].name, + model_name, + "expect model stats for model {}".format(model_name), + ) + self.assertEqual( + stats.model_stats[0].version, + "1", + "expect model stats for model {} version 1".format(model_name), + ) if batch_exec is not None: batch_stats = stats.model_stats[0].batch_stats print(batch_stats) self.assertEqual( - len(batch_stats), len(batch_exec), + len(batch_stats), + len(batch_exec), "expected {} different batch-sizes, got {}".format( - len(batch_exec), len(batch_stats))) + len(batch_exec), len(batch_stats) + ), + ) for batch_stat in batch_stats: bs = batch_stat.batch_size bc = batch_stat.compute_infer.count self.assertTrue( - bs in batch_exec, - "did not find expected batch-size {}".format(bs)) + bs in batch_exec, "did not find expected batch-size {}".format(bs) + ) # Get count from one of the stats self.assertEqual( - bc, batch_exec[bs], - "expected model-execution-count {} for batch size {}, got {}" - .format(batch_exec[bs], bs, bc)) + bc, + batch_exec[bs], + "expected model-execution-count {} for batch size {}, got {}".format( + batch_exec[bs], bs, bc + ), + ) actual_exec_cnt = stats.model_stats[0].execution_count self.assertEqual( - actual_exec_cnt, exec_cnt, - "expected model-exec-count {}, got {}".format( - exec_cnt, actual_exec_cnt)) + actual_exec_cnt, + exec_cnt, + "expected model-exec-count {}, got {}".format(exec_cnt, actual_exec_cnt), + ) actual_infer_cnt = stats.model_stats[0].inference_count self.assertEqual( - actual_infer_cnt, infer_cnt, + actual_infer_cnt, + infer_cnt, "expected model-inference-count {}, got {}".format( - infer_cnt, actual_infer_cnt)) + infer_cnt, actual_infer_cnt + ), + ) actual_infer_cnt = stats.model_stats[0].inference_count self.assertEqual( - actual_infer_cnt, infer_cnt, + actual_infer_cnt, + infer_cnt, "expected model-inference-count {}, got {}".format( - infer_cnt, actual_infer_cnt)) + infer_cnt, actual_infer_cnt + ), + ) def test_static_batch(self): iu.infer_shape_tensor( self, - 'plan', - np.float32, [[32, 32]], [[8, 4, 4]], + "plan", + np.float32, + [[32, 32]], + [[8, 4, 4]], use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, - batch_size=8) + batch_size=8, + ) iu.infer_shape_tensor( self, - 'plan', - np.float32, [[4, 4]], [[8, 32, 32]], + "plan", + np.float32, + [[4, 4]], + [[8, 32, 32]], use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, - batch_size=8) + batch_size=8, + ) iu.infer_shape_tensor( self, - 'plan', - np.float32, [[4, 4]], [[8, 4, 4]], + "plan", + np.float32, + [[4, 4]], + [[8, 4, 4]], use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, - batch_size=8) + batch_size=8, + ) def test_nobatch(self): iu.infer_shape_tensor( self, - 'plan_nobatch', - np.float32, [[32, 32]], [[4, 4]], - use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY) + "plan_nobatch", + np.float32, + [[32, 32]], + [[4, 4]], + use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, + ) iu.infer_shape_tensor( self, - 'plan_nobatch', - np.float32, [[4, 4]], [[32, 32]], - use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY) + "plan_nobatch", + np.float32, + [[4, 4]], + [[32, 32]], + use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, + ) iu.infer_shape_tensor( self, - 'plan_nobatch', - np.float32, [[4, 4]], [[4, 4]], - use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY) + "plan_nobatch", + np.float32, + [[4, 4]], + [[4, 4]], + use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, + ) def test_wrong_shape_values(self): over_shape_values = [[32, 33]] try: iu.infer_shape_tensor( self, - 'plan', + "plan", np.float32, - over_shape_values, [[8, 4, 4]], + over_shape_values, + [[8, 4, 4]], use_system_shared_memory=TEST_SYSTEM_SHARED_MEMORY, - batch_size=8) + batch_size=8, + ) # InferenceServerException will be raised from different namespace, # use dynamic type characteristic to catch both ex except Exception as ex: self.assertTrue( "The shape value at index 2 is expected to be in range from 1 to 32, Got: 33" - in ex.message()) + in ex.message() + ) # Dynamic Batcher tests def test_dynamic_different_shape_values(self): @@ -245,22 +310,27 @@ def test_dynamic_different_shape_values(self): threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(3, (6000, None)), - kwargs={ - 'shape_values': [[2, 2]], - 'dummy_input_shapes': [[16, 16]], - 'shm_suffix': '{}'.format(len(threads)) - })) + threading.Thread( + target=self.check_response, + args=(3, (6000, None)), + kwargs={ + "shape_values": [[2, 2]], + "dummy_input_shapes": [[16, 16]], + "shm_suffix": "{}".format(len(threads)), + }, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(3, (_max_queue_delay_ms * 1.5, - _max_queue_delay_ms)), - kwargs={ - 'shape_values': [[4, 4]], - 'dummy_input_shapes': [[16, 16]], - 'shm_suffix': '{}'.format(len(threads)) - })) + threading.Thread( + target=self.check_response, + args=(3, (_max_queue_delay_ms * 1.5, _max_queue_delay_ms)), + kwargs={ + "shape_values": [[4, 4]], + "dummy_input_shapes": [[16, 16]], + "shm_suffix": "{}".format(len(threads)), + }, + ) + ) threads[0].start() time.sleep(1) threads[1].start() @@ -283,21 +353,27 @@ def test_dynamic_identical_shape_values(self): threads = [] threads.append( - threading.Thread(target=self.check_response, - args=(4, (6000, None)), - kwargs={ - 'shape_values': [[4, 4]], - 'dummy_input_shapes': [[16, 16]], - 'shm_suffix': '{}'.format(len(threads)) - })) + threading.Thread( + target=self.check_response, + args=(4, (6000, None)), + kwargs={ + "shape_values": [[4, 4]], + "dummy_input_shapes": [[16, 16]], + "shm_suffix": "{}".format(len(threads)), + }, + ) + ) threads.append( - threading.Thread(target=self.check_response, - args=(2, (6000, None)), - kwargs={ - 'shape_values': [[4, 4]], - 'dummy_input_shapes': [[16, 16]], - 'shm_suffix': '{}'.format(len(threads)) - })) + threading.Thread( + target=self.check_response, + args=(2, (6000, None)), + kwargs={ + "shape_values": [[4, 4]], + "dummy_input_shapes": [[16, 16]], + "shm_suffix": "{}".format(len(threads)), + }, + ) + ) threads[0].start() time.sleep(1) threads[1].start() @@ -310,7 +386,6 @@ def test_dynamic_identical_shape_values(self): class SequenceBatcherShapeTensorTest(su.SequenceBatcherTestUtil): - def get_expected_result(self, expected_result, value, flag_str=None): # Adjust the expected_result for models expected_result = value @@ -333,20 +408,21 @@ def test_sequence_identical_shape_values(self): # Need scheduler to wait for queue to contain all # inferences for both sequences. self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), - 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) + self.assertTrue("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0) precreated_shm0_handles = self.precreate_register_shape_tensor_regions( - ((2, 1), (4, 2), (8, 3)), dtype, 0) + ((2, 1), (4, 2), (8, 3)), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_shape_tensor_regions( - ((2, 11), (4, 12), (8, 13)), dtype, 1) + ((2, 11), (4, 12), (8, 13)), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_shape_tensor_regions( - ((2, 111), (4, 112), (8, 113)), dtype, 2) + ((2, 111), (4, 112), (8, 113)), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_shape_tensor_regions( - ((2, 1111), (4, 1112), (8, 1113)), dtype, 3) + ((2, 1111), (4, 1112), (8, 1113)), dtype, 3 + ) threads = [] threads.append( threading.Thread( @@ -357,12 +433,17 @@ def test_sequence_identical_shape_values(self): 1001, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 1, None), (None, 4, 2, None), ("end", 8, - 3, None)), + ( + ("start", 2, 1, None), + (None, 4, 2, None), + ("end", 8, 3, None), + ), self.get_expected_result(6, 3, "end"), - precreated_shm0_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -372,12 +453,17 @@ def test_sequence_identical_shape_values(self): 1002, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 11, None), (None, 4, 12, None), - ("end", 8, 13, None)), + ( + ("start", 2, 11, None), + (None, 4, 12, None), + ("end", 8, 13, None), + ), self.get_expected_result(36, 13, "end"), - precreated_shm1_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -387,12 +473,17 @@ def test_sequence_identical_shape_values(self): 1003, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 111, None), (None, 4, 112, None), - ("end", 8, 113, None)), + ( + ("start", 2, 111, None), + (None, 4, 112, None), + ("end", 8, 113, None), + ), self.get_expected_result(336, 113, "end"), - precreated_shm2_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -402,12 +493,17 @@ def test_sequence_identical_shape_values(self): 1004, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 1111, None), (None, 4, 1112, None), - ("end", 8, 1113, None)), + ( + ("start", 2, 1111, None), + (None, 4, 1112, None), + ("end", 8, 1113, None), + ), self.get_expected_result(3336, 1113, "end"), - precreated_shm3_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) for t in threads: t.start() @@ -435,13 +531,17 @@ def test_sequence_different_shape_values(self): dtype = np.float32 precreated_shm0_handles = self.precreate_register_shape_tensor_regions( - ((1, 1), (1, 2), (1, 3)), dtype, 0) + ((1, 1), (1, 2), (1, 3)), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_shape_tensor_regions( - ((32, 11), (32, 12), (32, 13)), dtype, 1) + ((32, 11), (32, 12), (32, 13)), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_shape_tensor_regions( - ((16, 111), (16, 112), (16, 113)), dtype, 2) + ((16, 111), (16, 112), (16, 113)), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_shape_tensor_regions( - ((1, 1111), (1, 1112), (1, 1113)), dtype, 3) + ((1, 1111), (1, 1112), (1, 1113)), dtype, 3 + ) try: model_name = tu.get_sequence_model_name("plan", dtype) self.check_setup(model_name) @@ -449,12 +549,9 @@ def test_sequence_different_shape_values(self): # Need scheduler to wait for queue to contain all # inferences for both sequences. self.assertTrue("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), - 12) - self.assertTrue( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) - self.assertEqual( - int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0) + self.assertEqual(int(os.environ["TRITONSERVER_DELAY_SCHEDULER"]), 12) + self.assertTrue("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + self.assertEqual(int(os.environ["TRITONSERVER_BACKLOG_DELAY_SCHEDULER"]), 0) threads = [] threads.append( @@ -466,12 +563,17 @@ def test_sequence_different_shape_values(self): 1001, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 1, 1, None), (None, 1, 2, None), ("end", 1, - 3, None)), + ( + ("start", 1, 1, None), + (None, 1, 2, None), + ("end", 1, 3, None), + ), self.get_expected_result(6, 3, "end"), - precreated_shm0_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm0_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -481,12 +583,17 @@ def test_sequence_different_shape_values(self): 1002, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 32, 11, None), (None, 32, 12, None), - ("end", 32, 13, None)), + ( + ("start", 32, 11, None), + (None, 32, 12, None), + ("end", 32, 13, None), + ), self.get_expected_result(36, 13, "end"), - precreated_shm1_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm1_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -496,12 +603,17 @@ def test_sequence_different_shape_values(self): 1003, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 16, 111, None), (None, 16, 112, None), - ("end", 16, 113, None)), + ( + ("start", 16, 111, None), + (None, 16, 112, None), + ("end", 16, 113, None), + ), self.get_expected_result(336, 113, "end"), - precreated_shm2_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm2_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -511,12 +623,17 @@ def test_sequence_different_shape_values(self): 1004, (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 1, 1111, None), (None, 1, 1112, None), - ("end", 1, 1113, None)), + ( + ("start", 1, 1111, None), + (None, 1, 1112, None), + ("end", 1, 1113, None), + ), self.get_expected_result(3336, 1113, "end"), - precreated_shm3_handles), - kwargs={'sequence_name': "{}".format(self._testMethodName) - })) + precreated_shm3_handles, + ), + kwargs={"sequence_name": "{}".format(self._testMethodName)}, + ) + ) for t in threads: t.start() @@ -537,12 +654,7 @@ def test_sequence_different_shape_values(self): class DynaSequenceBatcherTest(su.SequenceBatcherTestUtil): - - def get_expected_result(self, - expected_result, - corrid, - value, - flag_str=None): + def get_expected_result(self, expected_result, corrid, value, flag_str=None): expected_result = value if flag_str is not None: if "start" in flag_str: @@ -556,20 +668,23 @@ def _multi_sequence_different_shape_impl(self, sleep_secs): dtype = np.float32 precreated_shm0_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((1, 1), (12, 2), (2, 3)), dtype, 0) + ((1, 1), (12, 2), (2, 3)), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((3, 11), (4, 12), (5, 13)), dtype, 1) + ((3, 11), (4, 12), (5, 13)), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((6, 111), (7, 112), (8, 113)), dtype, 2) + ((6, 111), (7, 112), (8, 113)), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((9, 1111), (10, 1112), (11, 1113)), dtype, 3) + ((9, 1111), (10, 1112), (11, 1113)), dtype, 3 + ) try: model_name = tu.get_dyna_sequence_model_name("plan", dtype) self.check_setup(model_name) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) corrids = [1001, 1002, 1003, 1004] threads = [] @@ -582,17 +697,22 @@ def _multi_sequence_different_shape_impl(self, sleep_secs): corrids[0], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 1, 1, None), (None, 12, 2, None), ("end", 2, - 3, None)), - self.get_expected_result(4 + corrids[0], corrids[0], 3, - "end"), - precreated_shm0_handles), + ( + ("start", 1, 1, None), + (None, 12, 2, None), + ("end", 2, 3, None), + ), + self.get_expected_result(4 + corrids[0], corrids[0], 3, "end"), + precreated_shm0_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[0]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[0] + ), + "using_dynamic_batcher": True, + }, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -602,17 +722,24 @@ def _multi_sequence_different_shape_impl(self, sleep_secs): corrids[1], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 3, 11, None), (None, 4, 12, None), - ("end", 5, 13, None)), - self.get_expected_result(36 + corrids[1], corrids[1], - 13, "end"), - precreated_shm1_handles), + ( + ("start", 3, 11, None), + (None, 4, 12, None), + ("end", 5, 13, None), + ), + self.get_expected_result( + 36 + corrids[1], corrids[1], 13, "end" + ), + precreated_shm1_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[1]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[1] + ), + "using_dynamic_batcher": True, + }, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -622,17 +749,24 @@ def _multi_sequence_different_shape_impl(self, sleep_secs): corrids[2], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 6, 111, None), (None, 7, 112, None), - ("end", 8, 113, None)), - self.get_expected_result(336 + corrids[2], corrids[2], - 113, "end"), - precreated_shm2_handles), + ( + ("start", 6, 111, None), + (None, 7, 112, None), + ("end", 8, 113, None), + ), + self.get_expected_result( + 336 + corrids[2], corrids[2], 113, "end" + ), + precreated_shm2_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[2]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[2] + ), + "using_dynamic_batcher": True, + }, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -642,17 +776,24 @@ def _multi_sequence_different_shape_impl(self, sleep_secs): corrids[3], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 9, 1111, None), (None, 10, 1112, None), - ("end", 11, 1113, None)), - self.get_expected_result(3336 + corrids[3], corrids[3], - 1113, "end"), - precreated_shm3_handles), + ( + ("start", 9, 1111, None), + (None, 10, 1112, None), + ("end", 11, 1113, None), + ), + self.get_expected_result( + 3336 + corrids[3], corrids[3], 1113, "end" + ), + precreated_shm3_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[3]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[3] + ), + "using_dynamic_batcher": True, + }, + ) + ) for t in threads: t.start() @@ -676,21 +817,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs): dtype = np.float32 precreated_shm0_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((2, 1), (4, 2), (8, 3)), dtype, 0) + ((2, 1), (4, 2), (8, 3)), dtype, 0 + ) precreated_shm1_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((2, 11), (4, 12), (8, 13)), dtype, 1) + ((2, 11), (4, 12), (8, 13)), dtype, 1 + ) precreated_shm2_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((2, 111), (4, 112), (8, 113)), dtype, 2) + ((2, 111), (4, 112), (8, 113)), dtype, 2 + ) precreated_shm3_handles = self.precreate_register_dynaseq_shape_tensor_regions( - ((2, 1111), (4, 1112), (8, 1113)), dtype, 3) + ((2, 1111), (4, 1112), (8, 1113)), dtype, 3 + ) try: model_name = tu.get_dyna_sequence_model_name("plan", dtype) self.check_setup(model_name) self.assertFalse("TRITONSERVER_DELAY_SCHEDULER" in os.environ) - self.assertFalse( - "TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) + self.assertFalse("TRITONSERVER_BACKLOG_DELAY_SCHEDULER" in os.environ) corrids = [1001, 1002, 1003, 1004] threads = [] @@ -703,17 +847,22 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs): corrids[0], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 1, None), (None, 4, 2, None), ("end", 8, - 3, None)), - self.get_expected_result(4 + corrids[0], corrids[0], 3, - "end"), - precreated_shm0_handles), + ( + ("start", 2, 1, None), + (None, 4, 2, None), + ("end", 8, 3, None), + ), + self.get_expected_result(4 + corrids[0], corrids[0], 3, "end"), + precreated_shm0_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[0]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[0] + ), + "using_dynamic_batcher": True, + }, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -723,17 +872,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs): corrids[1], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 11, None), (None, 4, 12, None), - ("end", 8, 13, None)), - self.get_expected_result(36 + corrids[1], corrids[1], - 13, "end"), - precreated_shm1_handles), + ( + ("start", 2, 11, None), + (None, 4, 12, None), + ("end", 8, 13, None), + ), + self.get_expected_result( + 36 + corrids[1], corrids[1], 13, "end" + ), + precreated_shm1_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[1]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[1] + ), + "using_dynamic_batcher": True, + }, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -743,17 +899,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs): corrids[2], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 111, None), (None, 4, 112, None), - ("end", 8, 113, None)), - self.get_expected_result(336 + corrids[2], corrids[2], - 113, "end"), - precreated_shm2_handles), + ( + ("start", 2, 111, None), + (None, 4, 112, None), + ("end", 8, 113, None), + ), + self.get_expected_result( + 336 + corrids[2], corrids[2], 113, "end" + ), + precreated_shm2_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[2]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[2] + ), + "using_dynamic_batcher": True, + }, + ) + ) threads.append( threading.Thread( target=self.check_sequence_shape_tensor_io, @@ -763,17 +926,24 @@ def _multi_sequence_identical_shape_impl(self, sleep_secs): corrids[3], (None, None), # (flag_str, shape_value, value, pre_delay_ms) - (("start", 2, 1111, None), (None, 4, 1112, None), - ("end", 8, 1113, None)), - self.get_expected_result(3336 + corrids[3], corrids[3], - 1113, "end"), - precreated_shm3_handles), + ( + ("start", 2, 1111, None), + (None, 4, 1112, None), + ("end", 8, 1113, None), + ), + self.get_expected_result( + 3336 + corrids[3], corrids[3], 1113, "end" + ), + precreated_shm3_handles, + ), kwargs={ - 'sequence_name': - "{}_{}".format(self._testMethodName, corrids[3]), - 'using_dynamic_batcher': - True - })) + "sequence_name": "{}_{}".format( + self._testMethodName, corrids[3] + ), + "using_dynamic_batcher": True, + }, + ) + ) for t in threads: t.start() @@ -815,5 +985,5 @@ def test_dynaseq_different_shape_values_parallel(self): self._multi_sequence_different_shape_impl(0) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_vertex_ai/test.sh b/qa/L0_vertex_ai/test.sh old mode 100644 new mode 100755 index d334d6c886..3113a66d1f --- a/qa/L0_vertex_ai/test.sh +++ b/qa/L0_vertex_ai/test.sh @@ -106,7 +106,7 @@ function vertex_ai_wait_for_server_ready() { WAIT_RET=1 } -# Helper function to unset all AIP vairables before test +# Helper function to unset all AIP variables before test function unset_vertex_variables() { unset AIP_MODE unset AIP_HTTP_PORT @@ -418,7 +418,7 @@ else fi fi -# Test AIP_STORAGE_URI won't be used if model repository is specified +# Test AIP_STORAGE_URI won't be used if model repository is specified SERVER_ARGS="--model-repository=single_model" run_server_nowait vertex_ai_wait_for_server_ready $SERVER_PID 10 diff --git a/qa/L0_vertex_ai/vertex_ai_test.py b/qa/L0_vertex_ai/vertex_ai_test.py old mode 100644 new mode 100755 index c421987538..b6f9fc42b4 --- a/qa/L0_vertex_ai/vertex_ai_test.py +++ b/qa/L0_vertex_ai/vertex_ai_test.py @@ -1,5 +1,5 @@ #!/usr/bin/python -# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -26,44 +26,34 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import sys + sys.path.append("../common") import os -import shutil -import time +import sys import unittest + import numpy as np -import infer_util as iu +import requests import test_util as tu import tritonclient.http as httpclient -import argparse -import csv -import json -import os -import requests -import socket -import sys - class VertexAiTest(tu.TestResultCollector): - def setUp(self): - port = os.getenv('AIP_HTTP_PORT', '8080') - predict_endpoint = os.getenv('AIP_PREDICT_ROUTE', '/predict') - self.model_ = os.getenv('TEST_EXPLICIT_MODEL_NAME', 'addsub') + port = os.getenv("AIP_HTTP_PORT", "8080") + predict_endpoint = os.getenv("AIP_PREDICT_ROUTE", "/predict") + self.model_ = os.getenv("TEST_EXPLICIT_MODEL_NAME", "addsub") self.url_ = "http://localhost:{}{}".format(port, predict_endpoint) - self.input_data_ = [ - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 - ] + self.input_data_ = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] self.expected_output0_data_ = [x * 2 for x in self.input_data_] self.expected_output1_data_ = [0 for x in self.input_data_] def test_predict(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -71,22 +61,20 @@ def test_predict(self): inputs[0].set_data_from_numpy(input_data, binary_data=False) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) request_body, _ = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + inputs, outputs=outputs + ) - headers = {'Content-Type': 'application/json'} + headers = {"Content-Type": "application/json"} r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() - result = httpclient.InferenceServerClient.parse_response_body( - r._content) + result = httpclient.InferenceServerClient.parse_response_body(r._content) - output0_data = result.as_numpy('OUTPUT0') - output1_data = result.as_numpy('OUTPUT1') + output0_data = result.as_numpy("OUTPUT0") + output1_data = result.as_numpy("OUTPUT1") for i in range(16): self.assertEqual(output0_data[0][i], self.expected_output0_data_[i]) self.assertEqual(output1_data[0][i], self.expected_output1_data_[i]) @@ -94,8 +82,8 @@ def test_predict(self): def test_predict_specified_model(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -103,27 +91,23 @@ def test_predict_specified_model(self): inputs[0].set_data_from_numpy(input_data, binary_data=False) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) request_body, _ = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/json', - "X-Vertex-Ai-Triton-Redirect": - "v2/models/{}/infer".format(self.model_) + "Content-Type": "application/json", + "X-Vertex-Ai-Triton-Redirect": "v2/models/{}/infer".format(self.model_), } r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() - result = httpclient.InferenceServerClient.parse_response_body( - r._content) + result = httpclient.InferenceServerClient.parse_response_body(r._content) - output0_data = result.as_numpy('OUTPUT0') - output1_data = result.as_numpy('OUTPUT1') + output0_data = result.as_numpy("OUTPUT0") + output1_data = result.as_numpy("OUTPUT1") if self.model_ == "addsub": expected_output0_data = [x * 2 for x in self.input_data_] expected_output1_data = [0 for x in self.input_data_] @@ -137,8 +121,8 @@ def test_predict_specified_model(self): def test_predict_request_binary(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -146,25 +130,26 @@ def test_predict_request_binary(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.vertex-ai-triton.binary+json;json-header-size={}' - .format(header_length) + "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size={}".format( + header_length + ) } r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() - result = httpclient.InferenceServerClient.parse_response_body( - r._content) - output0_data = result.as_numpy('OUTPUT0') - output1_data = result.as_numpy('OUTPUT1') + result = httpclient.InferenceServerClient.parse_response_body(r._content) + output0_data = result.as_numpy("OUTPUT0") + output1_data = result.as_numpy("OUTPUT1") for i in range(16): self.assertEqual(output0_data[0][i], self.expected_output0_data_[i]) self.assertEqual(output1_data[0][i], self.expected_output1_data_[i]) @@ -172,8 +157,8 @@ def test_predict_request_binary(self): def test_predict_response_binary(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -181,23 +166,23 @@ def test_predict_response_binary(self): inputs[0].set_data_from_numpy(input_data, binary_data=False) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=True)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=True)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) request_body, _ = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + inputs, outputs=outputs + ) - headers = {'Content-Type': 'application/json'} + headers = {"Content-Type": "application/json"} r = requests.post(self.url_, data=request_body, headers=headers) r.raise_for_status() - header_length_str = r.headers['Inference-Header-Content-Length'] + header_length_str = r.headers["Inference-Header-Content-Length"] result = httpclient.InferenceServerClient.parse_response_body( - r._content, header_length=int(header_length_str)) + r._content, header_length=int(header_length_str) + ) - output0_data = result.as_numpy('OUTPUT0') - output1_data = result.as_numpy('OUTPUT1') + output0_data = result.as_numpy("OUTPUT0") + output1_data = result.as_numpy("OUTPUT1") for i in range(16): self.assertEqual(output0_data[0][i], self.expected_output0_data_[i]) self.assertEqual(output1_data[0][i], self.expected_output1_data_[i]) @@ -205,8 +190,8 @@ def test_predict_response_binary(self): def test_malformed_binary_header(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -214,29 +199,34 @@ def test_malformed_binary_header(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'additional-string/application/vnd.vertex-ai-triton.binary+json;json-header-size={}' - .format(header_length) + "Content-Type": "additional-string/application/vnd.vertex-ai-triton.binary+json;json-header-size={}".format( + header_length + ) } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) def test_malformed_binary_header_not_number(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -244,29 +234,34 @@ def test_malformed_binary_header_not_number(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.vertex-ai-triton.binary+json;json-header-size=additional-string{}' - .format(header_length) + "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size=additional-string{}".format( + header_length + ) } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) def test_malformed_binary_header_negative_number(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -274,28 +269,32 @@ def test_malformed_binary_header_negative_number(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.vertex-ai-triton.binary+json;json-header-size=-123' + "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size=-123" } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) def test_malformed_binary_header_large_number(self): inputs = [] outputs = [] - inputs.append(httpclient.InferInput('INPUT0', [1, 16], "INT32")) - inputs.append(httpclient.InferInput('INPUT1', [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT0", [1, 16], "INT32")) + inputs.append(httpclient.InferInput("INPUT1", [1, 16], "INT32")) # Initialize the data input_data = np.array(self.input_data_, dtype=np.int32) @@ -303,23 +302,27 @@ def test_malformed_binary_header_large_number(self): inputs[0].set_data_from_numpy(input_data, binary_data=True) inputs[1].set_data_from_numpy(input_data, binary_data=False) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT0', binary_data=False)) - outputs.append( - httpclient.InferRequestedOutput('OUTPUT1', binary_data=False)) - request_body, header_length = httpclient.InferenceServerClient.generate_request_body( - inputs, outputs=outputs) + outputs.append(httpclient.InferRequestedOutput("OUTPUT0", binary_data=False)) + outputs.append(httpclient.InferRequestedOutput("OUTPUT1", binary_data=False)) + ( + request_body, + header_length, + ) = httpclient.InferenceServerClient.generate_request_body( + inputs, outputs=outputs + ) headers = { - 'Content-Type': - 'application/vnd.vertex-ai-triton.binary+json;json-header-size=12345' + "Content-Type": "application/vnd.vertex-ai-triton.binary+json;json-header-size=12345" } r = requests.post(self.url_, data=request_body, headers=headers) self.assertEqual( - 400, r.status_code, + 400, + r.status_code, "Expected error code {} returned for the request; got: {}".format( - 400, r.status_code)) + 400, r.status_code + ), + ) -if __name__ == '__main__': +if __name__ == "__main__": unittest.main() diff --git a/qa/L0_warmup/decoupled/1/model.py b/qa/L0_warmup/decoupled/1/model.py index db7c6903f5..9827a87f09 100644 --- a/qa/L0_warmup/decoupled/1/model.py +++ b/qa/L0_warmup/decoupled/1/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,11 +28,12 @@ class TritonPythonModel: - """Test model that always returns 0 response for all requests. """ + """Test model that always returns 0 response for all requests.""" def execute(self, requests): for request in requests: request.get_response_sender().send( - flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) return None diff --git a/qa/L0_warmup/failing_infer/1/model.py b/qa/L0_warmup/failing_infer/1/model.py index 1935fe6cd9..632477c903 100644 --- a/qa/L0_warmup/failing_infer/1/model.py +++ b/qa/L0_warmup/failing_infer/1/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,7 +28,7 @@ class TritonPythonModel: - """Test model that always returns error for all requests. """ + """Test model that always returns error for all requests.""" def execute(self, requests): responses = [] @@ -36,8 +36,9 @@ def execute(self, requests): for _ in requests: responses.append( pb_utils.InferenceResponse( - output_tensors=[], - error=pb_utils.TritonError("An Error Occurred"))) + output_tensors=[], error=pb_utils.TritonError("An Error Occurred") + ) + ) # You must return a list of pb_utils.InferenceResponse. Length # of this list must match the length of `requests` list. diff --git a/qa/L0_warmup/test.sh b/qa/L0_warmup/test.sh old mode 100644 new mode 100755 index aad83e1789..193f4b130d --- a/qa/L0_warmup/test.sh +++ b/qa/L0_warmup/test.sh @@ -1,5 +1,5 @@ #!/bin/bash -# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -42,6 +42,9 @@ export CUDA_VISIBLE_DEVICES=0 CLIENT=../clients/image_client CLIENT_LOG="./client.log" +CLIENT_PY=./python_unittest.py +EXPECTED_NUM_TESTS="1" +TEST_RESULT_FILE='test_results.txt' IMAGE="../images/vulture.jpeg" @@ -56,6 +59,7 @@ SERVER_LOG="./inference_server.log" source ../common/util.sh RET=0 +rm -fr *.txt for BACKEND in ${BACKENDS}; do rm -f $SERVER_LOG $CLIENT_LOG @@ -408,8 +412,83 @@ set -e kill $SERVER_PID wait $SERVER_PID -if [ $RET -eq 0 ]; then - echo -e "\n***\n*** Test Passed\n***" +# Test the onnx model to verify that the memory type of the output tensor +# remains unchanged with the warmup setting +pip3 uninstall -y torch +pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html + +rm -fr models && mkdir models +cp -r /data/inferenceserver/${REPO_VERSION}/qa_model_repository/onnx_nobatch_float32_float32_float32 models/. +(cd models/onnx_nobatch_float32_float32_float32 && \ + echo "" >> config.pbtxt && \ + echo 'instance_group [{' >> config.pbtxt && \ + echo ' kind : KIND_GPU' >> config.pbtxt && \ + echo '}]' >> config.pbtxt && \ + echo 'model_warmup [{' >> config.pbtxt && \ + echo ' name : "sample"' >> config.pbtxt && \ + echo ' batch_size: 1' >> config.pbtxt && \ + echo ' inputs {' >> config.pbtxt && \ + echo ' key: "INPUT0"' >> config.pbtxt && \ + echo ' value: {' >> config.pbtxt && \ + echo ' data_type: TYPE_FP32' >> config.pbtxt && \ + echo " dims: 16" >> config.pbtxt && \ + echo " zero_data: false" >> config.pbtxt && \ + echo ' }' >> config.pbtxt && \ + echo ' }' >> config.pbtxt && \ + echo ' inputs {' >> config.pbtxt && \ + echo ' key: "INPUT1"' >> config.pbtxt && \ + echo ' value: {' >> config.pbtxt && \ + echo ' data_type: TYPE_FP32' >> config.pbtxt && \ + echo " dims: 16" >> config.pbtxt && \ + echo " zero_data: false" >> config.pbtxt && \ + echo ' }' >> config.pbtxt && \ + echo ' }' >> config.pbtxt && \ + echo '}]' >> config.pbtxt ) + +mkdir -p models/bls_onnx_warmup/1/ +cp ../python_models/bls_onnx_warmup/model.py models/bls_onnx_warmup/1/ +cp ../python_models/bls_onnx_warmup/config.pbtxt models/bls_onnx_warmup/. + +cp ../L0_backend_python/python_unittest.py . +sed -i 's#sys.path.append("../../common")#sys.path.append("../common")#g' python_unittest.py + +run_server +if [ "$SERVER_PID" == "0" ]; then + echo -e "\n***\n*** Failed to start $SERVER\n***" + cat $SERVER_LOG + exit 1 +fi + +set +e + +export MODEL_NAME='bls_onnx_warmup' +python3 $CLIENT_PY >> $CLIENT_LOG 2>&1 +if [ $? -ne 0 ]; then + echo -e "\n***\n*** 'bls_onnx_warmup' test FAILED. \n***" + cat $CLIENT_LOG + RET=1 +else + check_test_results $TEST_RESULT_FILE $EXPECTED_NUM_TESTS + if [ $? -ne 0 ]; then + cat $CLIENT_LOG + echo -e "\n***\n*** Test Result Verification Failed\n***" + RET=1 + fi +fi + +set -e + + +kill $SERVER_PID +wait $SERVER_PID + + +if [ $RET -eq 1 ]; then + cat $CLIENT_LOG + cat $SERVER_LOG + echo -e "\n***\n*** Test Failed \n***" +else + echo -e "\n***\n*** Test Passed \n***" fi exit $RET diff --git a/qa/common/check_copyright.py b/qa/common/check_copyright.py index 7d6e8e0729..ff18ca8e39 100755 --- a/qa/common/check_copyright.py +++ b/qa/common/check_copyright.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 -# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,37 +28,68 @@ import argparse import os -import re import pathlib +import re FLAGS = None -SKIP_EXTS = ('jpeg', 'jpg', 'pgm', 'png', 'log', 'serverlog', 'preprocessed', - 'jmx', 'gz', 'json', 'pdf', 'so', 'onnx') -REPO_PATH_FROM_THIS_FILE = '../..' +SKIP_EXTS = ( + "jpeg", + "jpg", + "pgm", + "png", + "log", + "preprocessed", + "jmx", + "gz", + "json", + "pdf", + "so", + "onnx", + "svg", +) +REPO_PATH_FROM_THIS_FILE = "../.." SKIP_PATHS = ( - 'build', - 'deploy/gke-marketplace-app/.gitignore', - 'deploy/gke-marketplace-app/server-deployer/chart/.helmignore', - 'deploy/gcp/.helmignore', 'deploy/aws/.helmignore', - 'deploy/fleetcommand/.helmignore', 'docs/examples/model_repository', - 'docs/examples/jetson', 'docker', - 'qa/common/cuda_op_kernel.cu.cc.patch', - 'qa/ensemble_models/mix_platform_float32_float32_float32/output0_labels.txt', - 'qa/ensemble_models/mix_type_int32_float32_float32/output0_labels.txt', - 'qa/ensemble_models/mix_ensemble_int32_float32_float32/output0_labels.txt', - 'qa/ensemble_models/wrong_label_int32_float32_float32/output0_labels.txt', - 'qa/ensemble_models/label_override_int32_float32_float32/output0_labels.txt', - 'qa/L0_model_config/noautofill_platform', - 'qa/L0_model_config/autofill_noplatform', - 'qa/L0_model_config/autofill_noplatform_success', - 'qa/L0_model_config/special_cases', 'qa/L0_perf_nomodel/baseline', - 'qa/L0_perf_nomodel/legacy_baseline', 'qa/L0_warmup/raw_mug_data', - 'qa/L0_java_resnet/expected_output_data', - 'TRITON_VERSION') + "build", + "deploy/gke-marketplace-app/.gitignore", + "deploy/gke-marketplace-app/server-deployer/chart/.helmignore", + "deploy/gcp/.helmignore", + "deploy/aws/.helmignore", + "deploy/fleetcommand/.helmignore", + "docs/.gitignore", + "docs/_static/.gitattributes", + "docs/examples/model_repository", + "docs/examples/jetson", + "docker", + "qa/common/cuda_op_kernel.cu.cc.patch", + "qa/ensemble_models/mix_platform_float32_float32_float32/output0_labels.txt", + "qa/ensemble_models/mix_type_int32_float32_float32/output0_labels.txt", + "qa/ensemble_models/mix_ensemble_int32_float32_float32/output0_labels.txt", + "qa/ensemble_models/wrong_label_int32_float32_float32/output0_labels.txt", + "qa/ensemble_models/label_override_int32_float32_float32/output0_labels.txt", + "qa/L0_model_config/noautofill_platform", + "qa/L0_model_config/autofill_noplatform", + "qa/L0_model_config/autofill_noplatform_success", + "qa/L0_model_config/special_cases", + "qa/L0_model_config/cli_messages/cli_override/expected", + "qa/L0_model_config/cli_messages/cli_deprecation/expected", + "qa/L0_model_namespacing/test_duplication", + "qa/L0_model_namespacing/test_dynamic_resolution", + "qa/L0_model_namespacing/test_ensemble_duplication", + "qa/L0_model_namespacing/test_no_duplication", + "qa/L0_perf_nomodel/baseline", + "qa/L0_perf_nomodel/legacy_baseline", + "qa/L0_warmup/raw_mug_data", + "qa/L0_java_resnet/expected_output_data", + "qa/L0_trt_dla_jetson/trt_dla_model_store", + "qa/openvino_models/dynamic_batch", + "qa/openvino_models/fixed_batch", + "CITATION.cff", + "TRITON_VERSION", +) -COPYRIGHT_YEAR_RE = 'Copyright( \\(c\\))? 20[1-9][0-9](-(20)?[1-9][0-9])?(,((20[2-9][0-9](-(20)?[2-9][0-9])?)|([2-9][0-9](-[2-9][0-9])?)))*,? NVIDIA CORPORATION( & AFFILIATES)?. All rights reserved.' +COPYRIGHT_YEAR_RE = "Copyright( \\(c\\))? 20[1-9][0-9](-(20)?[1-9][0-9])?(,((20[2-9][0-9](-(20)?[2-9][0-9])?)|([2-9][0-9](-[2-9][0-9])?)))*,? NVIDIA CORPORATION( & AFFILIATES)?. All rights reserved." -COPYRIGHT = ''' +COPYRIGHT = """ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions @@ -83,10 +114,11 @@ OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -''' +""" -repo_abs_path = pathlib.Path(__file__).parent.joinpath( - REPO_PATH_FROM_THIS_FILE).resolve() +repo_abs_path = ( + pathlib.Path(__file__).parent.joinpath(REPO_PATH_FROM_THIS_FILE).resolve() +) copyright_year_re = re.compile(COPYRIGHT_YEAR_RE) @@ -96,32 +128,37 @@ def visit(path): print("visiting " + path) for skip in SKIP_EXTS: - if path.endswith('.' + skip): + if path.endswith("." + skip): if FLAGS.verbose: print("skipping due to extension: " + path) return True for skip in SKIP_PATHS: if str(pathlib.Path(path).resolve()).startswith( - str(repo_abs_path.joinpath(skip).resolve())): + str(repo_abs_path.joinpath(skip).resolve()) + ): if FLAGS.verbose: print("skipping due to path prefix: " + path) return True - with open(path, 'r') as f: + with open(path, "r") as f: first_line = True line = None try: for fline in f: line = fline - # Skip any '#!', '..', ' + +The models in this directory are TF2/keras models converted into OpenVINO +models. The "fixed_batch" model has a fixed batch dimension of 1 and the +"dynamic_batch" model has a variable batch dimension. + +The models are currently in **beta**, which they might not work as expected and +could be **changed, moved or deleted without warning** in the future. diff --git a/qa/openvino_models/dynamic_batch/1/model.bin b/qa/openvino_models/dynamic_batch/1/model.bin new file mode 100644 index 0000000000..e69de29bb2 diff --git a/qa/openvino_models/dynamic_batch/1/model.mapping b/qa/openvino_models/dynamic_batch/1/model.mapping new file mode 100644 index 0000000000..4705831777 --- /dev/null +++ b/qa/openvino_models/dynamic_batch/1/model.mapping @@ -0,0 +1,195 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/qa/openvino_models/dynamic_batch/1/model.xml b/qa/openvino_models/dynamic_batch/1/model.xml new file mode 100644 index 0000000000..59594953c6 --- /dev/null +++ b/qa/openvino_models/dynamic_batch/1/model.xml @@ -0,0 +1,166 @@ + + + + + + + + + + + 1 + 4 + + + + + + + + + + + + + + 1 + 4 + + + + + + + + + + + + + + 1 + 4 + + + 1 + 4 + + + + + 1 + 4 + + + + + + + + + + + + + + 1 + 4 + + + 1 + 4 + + + + + 1 + 4 + + + + + + + + + + + + + 1 + 4 + + + + + + + + + + 1 + 4 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/qa/openvino_models/fixed_batch/1/model.bin b/qa/openvino_models/fixed_batch/1/model.bin new file mode 100644 index 0000000000..e69de29bb2 diff --git a/qa/openvino_models/fixed_batch/1/model.mapping b/qa/openvino_models/fixed_batch/1/model.mapping new file mode 100644 index 0000000000..bd1a4eccb8 --- /dev/null +++ b/qa/openvino_models/fixed_batch/1/model.mapping @@ -0,0 +1,211 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/qa/openvino_models/fixed_batch/1/model.xml b/qa/openvino_models/fixed_batch/1/model.xml new file mode 100644 index 0000000000..e0f8954866 --- /dev/null +++ b/qa/openvino_models/fixed_batch/1/model.xml @@ -0,0 +1,152 @@ + + + + + + + + + + + 1 + 4 + + + + + + + + + + + 1 + 4 + + + + + + + + + + + 1 + 4 + + + 1 + 4 + + + + + 1 + 4 + + + + + + + + + + + 1 + 4 + + + 1 + 4 + + + + + 1 + 4 + + + + + + + + + + 1 + 4 + + + + + + + + + + 1 + 4 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/qa/python_models/add_sub/config.pbtxt b/qa/python_models/add_sub/config.pbtxt index b0805c0089..39bd6771d0 100644 --- a/qa/python_models/add_sub/config.pbtxt +++ b/qa/python_models/add_sub/config.pbtxt @@ -24,7 +24,6 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -name: "add_sub" backend: "python" input [ diff --git a/qa/python_models/add_sub/model.py b/qa/python_models/add_sub/model.py index 4aac895e1c..0868014804 100644 --- a/qa/python_models/add_sub/model.py +++ b/qa/python_models/add_sub/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,29 +24,28 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np import json + +import numpy as np import triton_python_backend_utils as pb_utils class TritonPythonModel: - def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" output0_dtype = self.output0_dtype output1_dtype = self.output1_dtype @@ -55,18 +54,21 @@ def execute(self, requests): for request in requests: in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") - if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy( - ).dtype == np.object_: - out_0, out_1 = (in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),\ - in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32)) + if ( + in_0.as_numpy().dtype.type is np.bytes_ + or in_0.as_numpy().dtype == np.object_ + ): + out_0, out_1 = ( + in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32), + in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32), + ) else: - out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(), - in_0.as_numpy() - in_1.as_numpy()) + out_0, out_1 = ( + in_0.as_numpy() + in_1.as_numpy(), + in_0.as_numpy() - in_1.as_numpy(), + ) - out_tensor_0 = pb_utils.Tensor("OUTPUT0", - out_0.astype(output0_dtype)) - out_tensor_1 = pb_utils.Tensor("OUTPUT1", - out_1.astype(output1_dtype)) - responses.append( - pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype)) + responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) return responses diff --git a/qa/python_models/add_sub_gpu/config.pbtxt b/qa/python_models/add_sub_gpu/config.pbtxt index 79154871c2..dd4a3ebecf 100644 --- a/qa/python_models/add_sub_gpu/config.pbtxt +++ b/qa/python_models/add_sub_gpu/config.pbtxt @@ -32,7 +32,7 @@ input [ name: "INPUT0" data_type: TYPE_FP32 dims: [ 4 ] - + } ] input [ @@ -40,7 +40,7 @@ input [ name: "INPUT1" data_type: TYPE_FP32 dims: [ 4 ] - + } ] output [ @@ -55,8 +55,8 @@ output [ name: "OUTPUT1" data_type: TYPE_FP32 dims: [ 4 ] - - + + } ] diff --git a/qa/python_models/auto_complete/model.py b/qa/python_models/auto_complete/model.py index c4768a562e..7f67182387 100644 --- a/qa/python_models/auto_complete/model.py +++ b/qa/python_models/auto_complete/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,19 +24,19 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np import json + +import numpy as np import triton_python_backend_utils as pb_utils class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) @@ -47,21 +47,20 @@ def auto_complete_config(auto_complete_model_config): return auto_complete_model_config def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" output0_dtype = self.output0_dtype output1_dtype = self.output1_dtype @@ -70,18 +69,21 @@ def execute(self, requests): for request in requests: in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") - if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy( - ).dtype == np.object_: - out_0, out_1 = (in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),\ - in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32)) + if ( + in_0.as_numpy().dtype.type is np.bytes_ + or in_0.as_numpy().dtype == np.object_ + ): + out_0, out_1 = ( + in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32), + in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32), + ) else: - out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(), - in_0.as_numpy() - in_1.as_numpy()) + out_0, out_1 = ( + in_0.as_numpy() + in_1.as_numpy(), + in_0.as_numpy() - in_1.as_numpy(), + ) - out_tensor_0 = pb_utils.Tensor("OUTPUT0", - out_0.astype(output0_dtype)) - out_tensor_1 = pb_utils.Tensor("OUTPUT1", - out_1.astype(output1_dtype)) - responses.append( - pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype)) + responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) return responses diff --git a/qa/python_models/auto_complete_error/model.py b/qa/python_models/auto_complete_error/model.py index b45a8f1149..1d611c36d5 100644 --- a/qa/python_models/auto_complete_error/model.py +++ b/qa/python_models/auto_complete_error/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,13 +24,8 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import json -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): """ @@ -38,10 +33,10 @@ def auto_complete_config(auto_complete_model_config): to test correct handling of Python errors in the `auto_complete_config` function. """ - input0 = {'name': 'INPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - input1 = {'name': 'INPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} - output0 = {'name': 'OUTPUT0', 'data_type': 'TYPE_FP32', 'dims': [4]} - output1 = {'name': 'OUTPUT1', 'data_type': 'TYPE_FP32', 'dims': [4]} + input0 = {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]} + input1 = {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]} + output0 = {"name": "OUTPUT0", "data_type": "TYPE_FP32", "dims": [4]} + output1 = {"name": "OUTPUT1", "data_type": "TYPE_FP32", "dims": [4]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input0) diff --git a/qa/python_models/bls/model.py b/qa/python_models/bls/model.py index 894aa9a09a..30bba29a70 100644 --- a/qa/python_models/bls/model.py +++ b/qa/python_models/bls/model.py @@ -1,4 +1,4 @@ -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,14 +24,17 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np +import gc +import os +import sys +import threading import unittest -import triton_python_backend_utils as pb_utils +from multiprocessing import Pool + +import numpy as np import torch +import triton_python_backend_utils as pb_utils from torch.utils.dlpack import from_dlpack, to_dlpack -import threading -from multiprocessing import Pool -import sys _deferred_exceptions_lock = threading.Lock() _deferred_exceptions = [] @@ -42,18 +45,19 @@ def bls_add_sub(_=None): input0_np = input0_np.astype(np.float32) input1_np = np.random.randn(*[16]) input1_np = input1_np.astype(np.float32) - input0 = pb_utils.Tensor('INPUT0', input0_np) - input1 = pb_utils.Tensor('INPUT1', input1_np) + input0 = pb_utils.Tensor("INPUT0", input0_np) + input1 = pb_utils.Tensor("INPUT1", input1_np) infer_request = pb_utils.InferenceRequest( - model_name='add_sub', + model_name="add_sub", inputs=[input0, input1], - requested_output_names=['OUTPUT0', 'OUTPUT1']) + requested_output_names=["OUTPUT0", "OUTPUT1"], + ) infer_response = infer_request.exec() if infer_response.has_error(): return False - output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0') - output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1") if output0 is None or output1 is None: return False @@ -69,7 +73,97 @@ def bls_add_sub(_=None): return True +def bls_square(_=None): + input0_np = np.random.randint(16, size=1, dtype=np.int32) + input0 = pb_utils.Tensor("IN", input0_np) + infer_request = pb_utils.InferenceRequest( + model_name="square_int32", inputs=[input0], requested_output_names=["OUT"] + ) + infer_responses = infer_request.exec(decoupled=True) + + response_count = 0 + + if infer_responses: + for infer_response in infer_responses: + if infer_response.has_error(): + return False + + if len(infer_response.output_tensors()) > 0: + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + if output0 is None: + return False + + expected_output = input0.as_numpy() + + if not np.all(expected_output == output0.as_numpy()): + return False + + response_count += 1 + + if not np.all(input0.as_numpy() == response_count - 1): + return False + + return True + + +def bls_libtorch(model_name, result_device): + shape = [16] + input0_np = np.random.rand(*shape).astype(np.float32) + input1_np = np.random.rand(*shape).astype(np.float32) + input0 = pb_utils.Tensor("INPUT0", input0_np) + input1 = pb_utils.Tensor("INPUT1", input1_np) + + if result_device == "CPU": + preferred_memory = pb_utils.PreferredMemory(pb_utils.TRITONSERVER_MEMORY_CPU) + else: + preferred_memory = pb_utils.PreferredMemory(pb_utils.TRITONSERVER_MEMORY_GPU, 0) + + infer_request = pb_utils.InferenceRequest( + model_name=model_name, + model_version=1, + inputs=[input0, input1], + requested_output_names=["OUTPUT__0", "OUTPUT__1"], + preferred_memory=preferred_memory, + ) + + infer_response = infer_request.exec() + if infer_response.has_error(): + return False + + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT__0") + output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT__1") + if output0 is None or output1 is None: + return False + + expected_output_0 = input0.as_numpy() + input1.as_numpy() + expected_output_1 = input0.as_numpy() - input1.as_numpy() + + if result_device == "CPU": + if not output0.is_cpu() or not output1.is_cpu(): + return False + + if not np.all(expected_output_0 == output0.as_numpy()): + return False + + if not np.all(expected_output_1 == output1.as_numpy()): + return False + else: + if output0.is_cpu() or output1.is_cpu(): + return False + output0 = from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy() + output1 = from_dlpack(output1.to_dlpack()).to("cpu").cpu().detach().numpy() + + if not np.all(output0 == expected_output_0): + return False + if not np.all(output1 == expected_output_1): + return False + + return True + + class PBBLSTest(unittest.TestCase): + def setUp(self): + self._is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False def add_deferred_exception(self, ex): global _deferred_exceptions @@ -82,84 +176,132 @@ def check_deferred_exception(self): raise _deferred_exceptions[0] def test_bls_wrong_inputs(self): - input0 = pb_utils.Tensor('INPUT0', np.random.randn(*[1, 16])) + input0 = pb_utils.Tensor("INPUT0", np.random.randn(*[1, 16])) - infer_request = pb_utils.InferenceRequest( - model_name='add_sub', - inputs=[input0], - requested_output_names=['OUTPUT0', 'OUTPUT1']) - infer_response = infer_request.exec() - self.assertTrue(infer_response.has_error()) - self.assertEqual( - infer_response.error().message(), - "expected 2 inputs but got 1 inputs for model 'add_sub'") - self.assertTrue(len(infer_response.output_tensors()) == 0) + if self._is_decoupled: + infer_request = pb_utils.InferenceRequest( + model_name="square_int32", inputs=[], requested_output_names=["OUT"] + ) + infer_responses = infer_request.exec(decoupled=True) + for infer_response in infer_responses: + self.assertTrue(infer_response.has_error()) + self.assertIn( + "expected 1 inputs but got 0 inputs for model 'square_int32'", + infer_response.error().message(), + ) + self.assertTrue(len(infer_response.output_tensors()) == 0) + else: + infer_request = pb_utils.InferenceRequest( + model_name="add_sub", + inputs=[input0], + requested_output_names=["OUTPUT0", "OUTPUT1"], + ) + infer_response = infer_request.exec() + self.assertTrue(infer_response.has_error()) + self.assertIn( + "expected 2 inputs but got 1 inputs for model 'add_sub'", + infer_response.error().message(), + ) + self.assertTrue(len(infer_response.output_tensors()) == 0) - def _send_bls_sequence_requests(self, correlation_id): + def _send_bls_sequence_requests(self, correlation_id, is_decoupled): # Start request try: - input = pb_utils.Tensor('INPUT', np.array([1000], dtype=np.int32)) + input = pb_utils.Tensor("INPUT", np.array([1000], dtype=np.int32)) infer_request = pb_utils.InferenceRequest( - model_name='onnx_nobatch_sequence_int32', + model_name="onnx_nobatch_sequence_int32", inputs=[input], - requested_output_names=['OUTPUT'], + requested_output_names=["OUTPUT"], flags=pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START, - correlation_id=correlation_id) - self.assertTrue(infer_request.flags(), - pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START) + correlation_id=correlation_id, + ) + self.assertTrue( + infer_request.flags(), pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START + ) infer_response = infer_request.exec() self.assertFalse(infer_response.has_error()) - output = pb_utils.get_output_tensor_by_name(infer_response, - 'OUTPUT') - self.assertEqual(output.as_numpy()[0], input.as_numpy()[0]) + output = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT") + self.assertFalse(output.is_cpu()) + output = from_dlpack(output.to_dlpack()).to("cpu").cpu().detach().numpy() + self.assertEqual(output[0], input.as_numpy()[0]) for i in range(10): - input = pb_utils.Tensor('INPUT', np.array([i], dtype=np.int32)) + input = pb_utils.Tensor("INPUT", np.array([i], dtype=np.int32)) infer_request = pb_utils.InferenceRequest( - model_name='onnx_nobatch_sequence_int32', + model_name="onnx_nobatch_sequence_int32", inputs=[input], - requested_output_names=['OUTPUT'], - correlation_id=correlation_id) - infer_response = infer_request.exec() + requested_output_names=["OUTPUT"], + correlation_id=correlation_id, + ) + + if is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() self.assertFalse(infer_response.has_error()) # The new output is the previous output + the current input - expected_output = output.as_numpy()[0] + i - output = pb_utils.get_output_tensor_by_name( - infer_response, 'OUTPUT') - self.assertEqual(output.as_numpy()[0], expected_output) + expected_output = output[0] + i + output = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT") + self.assertFalse(output.is_cpu()) + output = ( + from_dlpack(output.to_dlpack()).to("cpu").cpu().detach().numpy() + ) + self.assertEqual(output[0], expected_output) # Final request - input = pb_utils.Tensor('INPUT', np.array([2000], dtype=np.int32)) + input = pb_utils.Tensor("INPUT", np.array([2000], dtype=np.int32)) infer_request = pb_utils.InferenceRequest( - model_name='onnx_nobatch_sequence_int32', + model_name="onnx_nobatch_sequence_int32", inputs=[input], - requested_output_names=['OUTPUT'], - correlation_id=correlation_id) - infer_request.set_flags( - pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END) - self.assertTrue(infer_request.flags(), - pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END) + requested_output_names=["OUTPUT"], + correlation_id=correlation_id, + ) + infer_request.set_flags(pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END) + self.assertTrue( + infer_request.flags(), pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_END + ) + + if is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() - infer_response = infer_request.exec() self.assertFalse(infer_response.has_error()) - expected_output = output.as_numpy()[0] + input.as_numpy()[0] - output = pb_utils.get_output_tensor_by_name(infer_response, - 'OUTPUT') - self.assertEqual(output.as_numpy()[0], expected_output) + expected_output = output[0] + input.as_numpy()[0] + output = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT") + self.assertFalse(output.is_cpu()) + output = from_dlpack(output.to_dlpack()).to("cpu").cpu().detach().numpy() + self.assertEqual(output[0], expected_output) except Exception as e: self.add_deferred_exception(e) def test_bls_sequence(self): # Send 2 sequence of BLS requests simultaneously and check the responses. threads = [] - thread1 = threading.Thread(target=self._send_bls_sequence_requests, - args=(1000,)) + thread1 = threading.Thread( + target=self._send_bls_sequence_requests, + args=( + 1000, + self._is_decoupled, + ), + ) threads.append(thread1) - thread2 = threading.Thread(target=self._send_bls_sequence_requests, - args=(1001,)) + thread2 = threading.Thread( + target=self._send_bls_sequence_requests, + args=( + 1001, + self._is_decoupled, + ), + ) threads.append(thread2) for thread in threads: @@ -174,30 +316,39 @@ def test_bls_sequence(self): def test_bls_incorrect_args(self): with self.assertRaises(TypeError): pb_utils.InferenceRequest( - inputs=[], requested_output_names=['OUTPUT0', 'OUTPUT1']) + inputs=[], requested_output_names=["OUTPUT0", "OUTPUT1"] + ) with self.assertRaises(TypeError): pb_utils.InferenceRequest( - model_name='add_sub', - requested_output_names=['OUTPUT0', 'OUTPUT1']) + model_name="add_sub", requested_output_names=["OUTPUT0", "OUTPUT1"] + ) with self.assertRaises(TypeError): - pb_utils.InferenceRequest(model_name='add_sub', inputs=[]) + pb_utils.InferenceRequest(model_name="add_sub", inputs=[]) - def _get_gpu_bls_outputs(self, input0_pb, input1_pb): + def _get_gpu_bls_outputs(self, input0_pb, input1_pb, is_decoupled): """ This function is created to test that the DLPack container works properly when the inference response and outputs go out of scope. """ infer_request = pb_utils.InferenceRequest( - model_name='dlpack_add_sub', + model_name="dlpack_add_sub", inputs=[input0_pb, input1_pb], - requested_output_names=['OUTPUT0', 'OUTPUT1']) - infer_response = infer_request.exec() + requested_output_names=["OUTPUT0", "OUTPUT1"], + ) + if is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() + self.assertFalse(infer_response.has_error()) - output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0') - output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1") self.assertIsNotNone(output0) self.assertIsNotNone(output1) @@ -227,178 +378,435 @@ def _get_gpu_bls_outputs(self, input0_pb, input1_pb): output1_dlpack = None rc_after_del_dlpack_output0 = sys.getrefcount(output0) rc_after_del_dlpack_output1 = sys.getrefcount(output1) - self.assertEqual(rc_after_del_dlpack_output0 - rc_after_dlpack_output0, - -1) - self.assertEqual(rc_after_del_dlpack_output1 - rc_after_dlpack_output1, - -1) + self.assertEqual(rc_after_del_dlpack_output0 - rc_after_dlpack_output0, -1) + self.assertEqual(rc_after_del_dlpack_output1 - rc_after_dlpack_output1, -1) return output0.to_dlpack(), output1.to_dlpack() def test_zero_length_io(self): - model_name = 'identity_fp32' + model_name = "identity_fp32" input0 = np.zeros([1, 0], dtype=np.float32) - input0_pb = pb_utils.Tensor('INPUT0', input0) + input0_pb = pb_utils.Tensor("INPUT0", input0) infer_request = pb_utils.InferenceRequest( model_name=model_name, inputs=[input0_pb], - requested_output_names=['OUTPUT0']) - infer_response = infer_request.exec() + requested_output_names=["OUTPUT0"], + ) + + if self._is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() + self.assertFalse(infer_response.has_error()) - output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") self.assertTrue(np.all(output0 == input0)) - def test_bls_tensor_lifecycle(self): - model_name = 'dlpack_identity' + def cuda_memory_stats(self): + allocated_bytes = torch.cuda.memory_allocated() + reserved_bytes = torch.cuda.memory_reserved() + return allocated_bytes, reserved_bytes + + def bls_tensor_lifecycle_helper(self): + model_name = "dlpack_identity" + verbose = True # A 10 MB tensor. input_size = 10 * 1024 * 1024 + input_type_size_bytes = 4 # TYPE_FP32 + input_size_bytes = input_size * input_type_size_bytes # Sending the tensor 50 times to test whether the deallocation is # happening correctly. If the deallocation doesn't happen correctly, # there will be an out of shared memory error. for _ in range(50): input0 = np.ones([1, input_size], dtype=np.float32) - input0_pb = pb_utils.Tensor('INPUT0', input0) + input0_pb = pb_utils.Tensor("INPUT0", input0) infer_request = pb_utils.InferenceRequest( model_name=model_name, inputs=[input0_pb], - requested_output_names=['OUTPUT0']) - infer_response = infer_request.exec() + requested_output_names=["OUTPUT0"], + ) + + if self._is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() self.assertFalse(infer_response.has_error()) - output0 = pb_utils.get_output_tensor_by_name( - infer_response, 'OUTPUT0') - np.testing.assert_equal(output0.as_numpy(), input0, - "BLS CPU memory lifecycle failed.") + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + np.testing.assert_equal( + output0.as_numpy(), input0, "BLS CPU memory lifecycle failed." + ) + + # Show total memory stats before gpu tensor test + print(torch.cuda.memory_summary()) # Checking the same with the GPU tensors. for index in range(50): input0 = None infer_request = None input0_pb = None + fail_msg = f"GPU memory lifecycle test failed at index: {index}" torch.cuda.empty_cache() - free_memory, _ = torch.cuda.mem_get_info() - if index == 1: - recorded_memory = free_memory - - if index > 1: - self.assertEqual(free_memory, recorded_memory, - "GPU memory lifecycle test failed.") + alloced, cached = self.cuda_memory_stats() + + # Check cuda memory usage is cleaned up (empty) between iterations + # when device tensors go out of scope + self.assertEqual(alloced, 0, fail_msg) + # Check that cache is properly cleaned up when emptied + self.assertEqual(cached, 0, fail_msg) + + if verbose: + # NOTE: this reflects total gpu memory usage, and may be affected + # by other processes, so don't use it for direct checks but log it + # for debugging/context. + free_memory, total_memory = torch.cuda.mem_get_info() + used_memory = total_memory - free_memory + print(f"[DEBUG][Iteration {index}][GPU] {used_memory=} bytes") + + input0 = torch.ones([1, input_size], dtype=torch.float32).to("cuda") + input0_pb = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0)) + # Check cuda memory usage after creating device tensor + alloced, _ = self.cuda_memory_stats() + self.assertEqual( + alloced, + input_size_bytes, + "Expected precise byte allocation after input tensor creation", + ) - input0 = torch.ones([1, input_size], dtype=torch.float32).to('cuda') - input0_pb = pb_utils.Tensor.from_dlpack('INPUT0', to_dlpack(input0)) infer_request = pb_utils.InferenceRequest( model_name=model_name, inputs=[input0_pb], - requested_output_names=['OUTPUT0']) - infer_response = infer_request.exec() + requested_output_names=["OUTPUT0"], + ) + + if self._is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() + self.assertFalse(infer_response.has_error()) - output0 = pb_utils.get_output_tensor_by_name( - infer_response, 'OUTPUT0') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") output0_pytorch = from_dlpack(output0.to_dlpack()) + # Stats after getting output tensor + alloced, _ = self.cuda_memory_stats() + self.assertEqual( + alloced, + input_size_bytes, + "Expected only input allocation, as output zero-copies input tensor", + ) + # Set inference response and output0_pytorch to None, to make sure # that the DLPack is still valid. output0 = None infer_response = None self.assertTrue( torch.all(output0_pytorch == input0), - f"input ({input0}) and output ({output0_pytorch}) didn't match for identity model." + f"input ({input0}) and output ({output0_pytorch}) didn't match for identity model.", ) - def _test_gpu_bls_add_sub(self, is_input0_gpu, is_input1_gpu): + print(torch.cuda.memory_summary()) + + def assert_cuda_memory_empty(self, msg): + torch.cuda.empty_cache() + alloced, cached = self.cuda_memory_stats() + self.assertEqual(alloced, 0, msg) + self.assertEqual(cached, 0, msg) + + def test_bls_tensor_lifecycle(self): + self.assert_cuda_memory_empty("Expected all gpu memory cleaned up before test") + self.bls_tensor_lifecycle_helper() + self.assert_cuda_memory_empty("Expected all gpu memory cleaned up after test") + + def _test_gpu_bls_add_sub(self, is_input0_gpu, is_input1_gpu, is_decoupled=False): input0 = torch.rand(16) input1 = torch.rand(16) if is_input0_gpu: - input0 = input0.to('cuda') + input0 = input0.to("cuda") if is_input1_gpu: - input1 = input1.to('cuda') + input1 = input1.to("cuda") + + input0_pb = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0)) + input1_pb = pb_utils.Tensor.from_dlpack("INPUT1", to_dlpack(input1)) - input0_pb = pb_utils.Tensor.from_dlpack('INPUT0', to_dlpack(input0)) - input1_pb = pb_utils.Tensor.from_dlpack('INPUT1', to_dlpack(input1)) output0_dlpack, output1_dlpack = self._get_gpu_bls_outputs( - input0_pb, input1_pb) + input0_pb, input1_pb, is_decoupled=is_decoupled + ) - expected_output_0 = from_dlpack( - input0_pb.to_dlpack()).to('cpu') + from_dlpack( - input1_pb.to_dlpack()).to('cpu') - expected_output_1 = from_dlpack( - input0_pb.to_dlpack()).to('cpu') - from_dlpack( - input1_pb.to_dlpack()).to('cpu') + expected_output_0 = from_dlpack(input0_pb.to_dlpack()).to("cpu") + from_dlpack( + input1_pb.to_dlpack() + ).to("cpu") + expected_output_1 = from_dlpack(input0_pb.to_dlpack()).to("cpu") - from_dlpack( + input1_pb.to_dlpack() + ).to("cpu") self.assertTrue( - torch.all( - expected_output_0 == from_dlpack(output0_dlpack).to('cpu'))) + torch.all(expected_output_0 == from_dlpack(output0_dlpack).to("cpu")) + ) self.assertTrue( - torch.all( - expected_output_1 == from_dlpack(output1_dlpack).to('cpu'))) + torch.all(expected_output_1 == from_dlpack(output1_dlpack).to("cpu")) + ) def test_gpu_bls(self): for input0_device in [True, False]: for input1_device in [True, False]: - self._test_gpu_bls_add_sub(input0_device, input1_device) + self._test_gpu_bls_add_sub( + input0_device, input1_device, self._is_decoupled + ) def test_multiprocess(self): # Test multiprocess Pool with sync BLS - pool = Pool(10) - pool.map(bls_add_sub, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) - pool.close() - pool.join() + if self._is_decoupled: + # Fixme: DLIS-4630 + # func_name = bls_square + pass + else: + func_name = bls_add_sub + + pool = Pool(10) + pool.map(func_name, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + pool.close() + pool.join() def test_bls_sync(self): infer_request = pb_utils.InferenceRequest( - model_name='non_existent_model', - inputs=[], - requested_output_names=[]) - infer_response = infer_request.exec() - - # Because the model doesn't exist, the inference response must have an - # error - self.assertTrue(infer_response.has_error()) - self.assertEqual( - infer_response.error().message(), - "Failed for execute the inference request. Model 'non_existent_model' is not ready." + model_name="non_existent_model", inputs=[], requested_output_names=[] ) - # Make sure that the inference requests can be performed properly after - # an error. - self.assertTrue(bls_add_sub()) + if self._is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + + for infer_response in infer_responses: + # Because the model doesn't exist, the inference response must have an + # error + self.assertTrue(infer_response.has_error()) + self.assertIn( + "Failed for execute the inference request. Model 'non_existent_model' is not ready.", + infer_response.error().message(), + ) + + # Make sure that the inference requests can be performed properly after + # an error. + self.assertTrue(bls_square()) + else: + infer_response = infer_request.exec() + + # Because the model doesn't exist, the inference response must have an + # error + self.assertTrue(infer_response.has_error()) + self.assertIn( + "Failed for execute the inference request. Model 'non_existent_model' is not ready.", + infer_response.error().message(), + ) + + # Make sure that the inference requests can be performed properly after + # an error. + self.assertTrue(bls_add_sub()) def test_bls_execute_error(self): # Test BLS with a model that has an error during execution. - infer_request = pb_utils.InferenceRequest(model_name='execute_error', - inputs=[], - requested_output_names=[]) - infer_response = infer_request.exec() + infer_request = pb_utils.InferenceRequest( + model_name="execute_error", inputs=[], requested_output_names=[] + ) + if self._is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() + self.assertTrue(infer_response.has_error()) - self.assertEqual( + self.assertIn( + "expected 1 inputs but got 0 inputs for model 'execute_error'", infer_response.error().message(), - "expected 1 inputs but got 0 inputs for model 'execute_error'") + ) self.assertTrue(len(infer_response.output_tensors()) == 0) def test_multiple_bls(self): # Test running multiple BLS requests together - for _ in range(100): - self.assertTrue(bls_add_sub()) + if self._is_decoupled: + for _ in range(100): + self.assertTrue(bls_square()) + else: + for _ in range(100): + self.assertTrue(bls_add_sub()) + def test_timeout(self): + tensor_size = [1, 1024 * 1024] + input0_np = np.random.randn(*tensor_size) + input0 = pb_utils.Tensor("INPUT0", input0_np.astype(np.float32)) + infer_request = pb_utils.InferenceRequest( + model_name="identity_fp32_timeout", + inputs=[input0], + requested_output_names=["OUTPUT0"], + timeout=5, + ) -class TritonPythonModel: + if self._is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + else: + infer_response = infer_request.exec() + # Expect timeout error + self.assertTrue(infer_response.has_error()) + self.assertIn("Request timeout expired", infer_response.error().message()) + self.assertTrue(len(infer_response.output_tensors()) == 0) + + # Verifies two things: + # 1. A request timeout can be accessed by receiver models + # 2. A user can specify a very large value (11s) for a timeout + infer_request = pb_utils.InferenceRequest( + model_name="identity_fp32_timeout", + inputs=[input0], + requested_output_names=["OUTPUT0"], + timeout=11000000000, + ) + + if self._is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + else: + infer_response = infer_request.exec() + + # Expect no timeout error. Check for log message + # in test.sh + self.assertFalse(infer_response.has_error()) + + def _test_response_iterator_square( + self, expected_output_cnt, expected_output_value, response_iterator + ): + response_count = 0 + expected_output_cnt = np.array([expected_output_cnt], dtype=np.int32) + + for infer_response in response_iterator: + self.assertFalse(infer_response.has_error()) + if len(infer_response.output_tensors()) > 0: + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + self.assertIsNotNone(output0) + self.assertEqual(expected_output_value, output0.as_numpy()) + + response_count += 1 + + self.assertEqual(response_count, expected_output_cnt) + + # Make sure the iterator is exhausted. + with self.assertRaises(StopIteration): + next(response_iterator) + + return response_iterator + + def test_response_iterator(self): + if self._is_decoupled: + # Test the response iterator for decoupled responses. The request + # has 4 decoupled responses followed by an empty response. + response_value = 4 + input0_np = np.array([response_value], dtype=np.int32) + input0 = pb_utils.Tensor("IN", input0_np) + infer_request = pb_utils.InferenceRequest( + model_name="square_int32", + inputs=[input0], + requested_output_names=["OUT"], + ) + infer_responses = infer_request.exec(decoupled=True) + + # case 1. Use Next() to get the next response first, then use + # for-loop to get the remaining responses. + infer_response = next(infer_responses) + self.assertFalse(infer_response.has_error()) + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + self.assertIsNotNone(output0) + self.assertEqual(response_value, output0.as_numpy()) + # The iterator now should only have 4 remaining responses. + infer_responses = self._test_response_iterator_square( + 4, response_value, infer_responses + ) + + # case 2. Call for-loop to get all the responses multiple times. + infer_responses = self._test_response_iterator_square( + 5, response_value, infer_responses + ) + infer_responses = self._test_response_iterator_square( + 5, response_value, infer_responses + ) + infer_responses = self._test_response_iterator_square( + 5, response_value, infer_responses + ) + + # case 3. Break from the iteration, then use Next() and for-loop to + # get the remaining responses. + response_count = 0 + for infer_response in infer_responses: + self.assertFalse(infer_response.has_error()) + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + self.assertIsNotNone(output0) + self.assertEqual(response_value, output0.as_numpy()) + + response_count += 1 + if response_count == 2: + break + + infer_response = next(infer_responses) + self.assertFalse(infer_response.has_error()) + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + self.assertIsNotNone(output0) + self.assertEqual(response_value, output0.as_numpy()) + + # The iterator now should only have 2 remaining responses. + infer_responses = self._test_response_iterator_square( + 2, response_value, infer_responses + ) + + # case 4. Delete the iterator before all the responses have been + # retrieved. + infer_responses = infer_request.exec(decoupled=True) + + infer_response = next(infer_responses) + self.assertFalse(infer_response.has_error()) + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + self.assertIsNotNone(output0) + self.assertEqual(response_value, output0.as_numpy()) + + del infer_responses + + def test_preferred_memory(self): + self.assertTrue(bls_libtorch("libtorch_gpu", "CPU")) + self.assertTrue(bls_libtorch("libtorch_cpu", "GPU")) + + +class TritonPythonModel: def execute(self, requests): responses = [] for _ in requests: # Run the unittest and store the results in InferenceResponse. - test = unittest.main('model', exit=False) + test = unittest.main("model", exit=False) + for test_case, traceback in test.result.failures: + print(f"{test_case} failed:\n{traceback}") responses.append( - pb_utils.InferenceResponse([ - pb_utils.Tensor( - 'OUTPUT0', - np.array([test.result.wasSuccessful()], - dtype=np.float16)) - ])) + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", + np.array([test.result.wasSuccessful()], dtype=np.float16), + ) + ] + ) + ) return responses diff --git a/qa/python_models/bls_async/model.py b/qa/python_models/bls_async/model.py index 676b7727de..8d75259b7b 100644 --- a/qa/python_models/bls_async/model.py +++ b/qa/python_models/bls_async/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,44 +24,43 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +import asyncio +import os + import numpy as np -import triton_python_backend_utils as pb_utils import torch +import triton_python_backend_utils as pb_utils from torch.utils.dlpack import from_dlpack, to_dlpack -import asyncio def verify_add_sub_results(input0, input1, infer_response): if infer_response.has_error(): + print("Async BLS failed:", infer_response.error().message(), flush=True) return False - output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0') - output1 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT1') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1") if (output0 is None) or (output1 is None): return False if not input0.is_cpu(): - input0 = from_dlpack( - input0.to_dlpack()).to('cpu').cpu().detach().numpy() + input0 = from_dlpack(input0.to_dlpack()).to("cpu").cpu().detach().numpy() else: input0 = input0.as_numpy() if not input1.is_cpu(): - input1 = from_dlpack( - input1.to_dlpack()).to('cpu').cpu().detach().numpy() + input1 = from_dlpack(input1.to_dlpack()).to("cpu").cpu().detach().numpy() else: input1 = input1.as_numpy() if not output0.is_cpu(): - output0 = from_dlpack( - output0.to_dlpack()).to('cpu').cpu().detach().numpy() + output0 = from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy() else: output0 = output0.as_numpy() if not output1.is_cpu(): - output1 = from_dlpack( - output1.to_dlpack()).to('cpu').cpu().detach().numpy() + output1 = from_dlpack(output1.to_dlpack()).to("cpu").cpu().detach().numpy() else: output1 = output1.as_numpy() @@ -69,11 +68,56 @@ def verify_add_sub_results(input0, input1, infer_response): expected_output_1 = input0 - input1 if not np.all(expected_output_0 == output0): - print(f'For OUTPUT0 expected {expected_output_0} found {output0}') + print(f"For OUTPUT0 expected {expected_output_0} found {output0}") return False if not np.all(expected_output_1 == output1): - print(f'For OUTPUT1 expected {expected_output_1} found {output1}') + print(f"For OUTPUT1 expected {expected_output_1} found {output1}") + return False + + return True + + +def verify_square_results(input0, infer_responses): + if not input0.is_cpu(): + input0 = from_dlpack(input0.to_dlpack()).to("cpu").cpu().detach().numpy() + else: + input0 = input0.as_numpy() + + response_count = 0 + + for infer_response in infer_responses: + if infer_response.has_error(): + print( + "Async BLS decoupled failed:", + infer_response.error().message(), + flush=True, + ) + return False + + if len(infer_response.output_tensors()) > 0: + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + + if output0 is None: + return False + + if not output0.is_cpu(): + output0 = ( + from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy() + ) + else: + output0 = output0.as_numpy() + + expected_output = input0 + + if not np.all(expected_output == input0): + print(f"For OUT expected {expected_output} found {output0}") + return False + + response_count += 1 + + if not np.all(input0 == response_count - 1): + print("Expected {} responses, got {}".format(input0, response_count - 1)) return False return True @@ -85,23 +129,36 @@ def create_addsub_inference_request(gpu=False): input1_np = np.random.randn(16) input0_np = input0_np.astype(np.float32) input1_np = input1_np.astype(np.float32) - input0 = pb_utils.Tensor('INPUT0', input0_np) - input1 = pb_utils.Tensor('INPUT1', input1_np) + input0 = pb_utils.Tensor("INPUT0", input0_np) + input1 = pb_utils.Tensor("INPUT1", input1_np) else: - input0_pytorch = torch.rand(16).to('cuda') - input1_pytorch = torch.rand(16).to('cuda') - input0 = pb_utils.Tensor.from_dlpack('INPUT0', - to_dlpack(input0_pytorch)) - input1 = pb_utils.Tensor.from_dlpack('INPUT1', - to_dlpack(input1_pytorch)) + input0_pytorch = torch.rand(16).to("cuda") + input1_pytorch = torch.rand(16).to("cuda") + input0 = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(input0_pytorch)) + input1 = pb_utils.Tensor.from_dlpack("INPUT1", to_dlpack(input1_pytorch)) infer_request = pb_utils.InferenceRequest( - model_name='dlpack_add_sub', + model_name="dlpack_add_sub", inputs=[input0, input1], - requested_output_names=['OUTPUT0', 'OUTPUT1']) + requested_output_names=["OUTPUT0", "OUTPUT1"], + ) return input0, input1, infer_request +def create_square_inference_request(gpu=False): + if not gpu: + input0_np = np.random.randint(16, size=1, dtype=np.int32) + input0 = pb_utils.Tensor("IN", input0_np) + else: + input0_pytorch = torch.randint(1, 16, (1,), dtype=torch.int32).to("cuda") + input0 = pb_utils.Tensor.from_dlpack("IN", to_dlpack(input0_pytorch)) + + infer_request = pb_utils.InferenceRequest( + model_name="dlpack_square", inputs=[input0], requested_output_names=["OUT"] + ) + return input0, infer_request + + async def async_bls_add_sub(): input0, input1, infer_request = create_addsub_inference_request() infer_response = await infer_request.async_exec() @@ -117,7 +174,22 @@ async def async_bls_add_sub(): return True -async def multiple_async_bls(gpu): +async def async_bls_square(): + input0, infer_request = create_square_inference_request() + infer_responses = await infer_request.async_exec(decoupled=True) + result_correct = verify_square_results(input0, infer_responses) + if not result_correct: + return False + + infer_responses_sync = infer_request.exec(decoupled=True) + result_correct = verify_square_results(input0, infer_responses_sync) + if not result_correct: + return False + + return True + + +async def multiple_async_bls_addsub(gpu): infer_request_aws = [] inputs = [] for _ in range(10): @@ -127,14 +199,26 @@ async def multiple_async_bls(gpu): infer_responses = await asyncio.gather(*infer_request_aws) for infer_response, input_pair in zip(infer_responses, inputs): - if infer_response.has_error(): - print('Async BLS failed:', - infer_response.error().message(), - flush=True) + result_correct = verify_add_sub_results( + input_pair[0], input_pair[1], infer_response + ) + if not result_correct: return False - result_correct = verify_add_sub_results(input_pair[0], input_pair[1], - infer_response) + return True + + +async def multiple_async_bls_square(gpu): + infer_request_aws = [] + inputs = [] + for _ in range(10): + input0, infer_request = create_square_inference_request(gpu) + inputs.append(input0) + infer_request_aws.append(infer_request.async_exec(decoupled=True)) + + async_responses = await asyncio.gather(*infer_request_aws) + for infer_responses, input_pair in zip(async_responses, inputs): + result_correct = verify_square_results(input_pair, infer_responses) if not result_correct: return False @@ -142,18 +226,26 @@ async def multiple_async_bls(gpu): class TritonPythonModel: - async def execute(self, requests): + is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False + responses = [] for _ in requests: - test1 = await multiple_async_bls(gpu=True) - test2 = await multiple_async_bls(gpu=False) - test3 = await async_bls_add_sub() + if is_decoupled: + test1 = await multiple_async_bls_square(gpu=True) + test2 = await multiple_async_bls_square(gpu=False) + test3 = await async_bls_square() + else: + test1 = await multiple_async_bls_addsub(gpu=True) + test2 = await multiple_async_bls_addsub(gpu=False) + test3 = await async_bls_add_sub() responses.append( - pb_utils.InferenceResponse(output_tensors=[ - pb_utils.Tensor('OUTPUT0', np.array([test1 & test2 & - test3])) - ])) + pb_utils.InferenceResponse( + output_tensors=[ + pb_utils.Tensor("OUTPUT0", np.array([test1 & test2 & test3])) + ] + ) + ) return responses diff --git a/qa/python_models/bls_finalize_error/config.pbtxt b/qa/python_models/bls_finalize_error/config.pbtxt new file mode 100644 index 0000000000..ff5f42188b --- /dev/null +++ b/qa/python_models/bls_finalize_error/config.pbtxt @@ -0,0 +1,38 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "bls_finalize_error" +backend: "python" + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/bls_finalize_error/model.py b/qa/python_models/bls_finalize_error/model.py new file mode 100644 index 0000000000..a38b1080ad --- /dev/null +++ b/qa/python_models/bls_finalize_error/model.py @@ -0,0 +1,45 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + pass + + def execute(self, requests): + pass + + def finalize(self): + print("Cleaning up...") + input0_np = np.random.randint(3, size=1, dtype=np.int32) + input0 = pb_utils.Tensor("IN", input0_np) + infer_request = pb_utils.InferenceRequest( + model_name="square_int32", inputs=[input0], requested_output_names=["OUT"] + ) + infer_responses = infer_request.exec(decoupled=True) diff --git a/qa/python_models/bls_init_error/config.pbtxt b/qa/python_models/bls_init_error/config.pbtxt new file mode 100644 index 0000000000..6cf5024e1f --- /dev/null +++ b/qa/python_models/bls_init_error/config.pbtxt @@ -0,0 +1,38 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "bls_init_error" +backend: "python" + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/bls_init_error/model.py b/qa/python_models/bls_init_error/model.py new file mode 100644 index 0000000000..b2518e0334 --- /dev/null +++ b/qa/python_models/bls_init_error/model.py @@ -0,0 +1,44 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + input0_np = np.random.randint(3, size=1, dtype=np.int32) + input0 = pb_utils.Tensor("IN", input0_np) + infer_request = pb_utils.InferenceRequest( + model_name="square_int32", inputs=[input0], requested_output_names=["OUT"] + ) + infer_responses = infer_request.exec(decoupled=True) + + def execute(self, requests): + pass + + def finalize(self): + print("Cleaning up...") diff --git a/qa/python_models/bls_memory/model.py b/qa/python_models/bls_memory/model.py index 101c321ec8..69da4f440f 100644 --- a/qa/python_models/bls_memory/model.py +++ b/qa/python_models/bls_memory/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,64 +24,80 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np +import os import unittest + +import numpy as np import triton_python_backend_utils as pb_utils class PBBLSMemoryTest(unittest.TestCase): + def setUp(self): + self._is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False - def _send_identity_tensor(self, size): + def _send_identity_tensor(self, size, is_decoupled): tensor_size = [1, size] input0_np = np.random.randn(*tensor_size) - input0 = pb_utils.Tensor('INPUT0', input0_np.astype(np.float32)) + input0 = pb_utils.Tensor("INPUT0", input0_np.astype(np.float32)) infer_request = pb_utils.InferenceRequest( - model_name='identity_fp32', + model_name="identity_fp32", inputs=[input0], - requested_output_names=['OUTPUT0']) - return input0_np, infer_request.exec() + requested_output_names=["OUTPUT0"], + ) + + if is_decoupled: + infer_responses = infer_request.exec(decoupled=True) + infer_response = next(infer_responses) + with self.assertRaises(StopIteration): + next(infer_responses) + else: + infer_response = infer_request.exec() + + return input0_np, infer_response def test_bls_out_of_memory(self): - tensor_size = 1024 * 1024 * 1024 - input0_np, infer_response = self._send_identity_tensor(tensor_size) + tensor_size = 256 * 1024 * 1024 + input0_np, infer_response = self._send_identity_tensor( + tensor_size, self._is_decoupled + ) out_of_memory_message = "Failed to increase the shared memory pool size for key" if infer_response.has_error(): - self.assertIn(out_of_memory_message, - infer_response.error().message()) + self.assertIn(out_of_memory_message, infer_response.error().message()) else: self.assertFalse(infer_response.has_error()) - output0 = pb_utils.get_output_tensor_by_name( - infer_response, 'OUTPUT0') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") self.assertIsNotNone(output0) self.assertTrue(np.allclose(output0.as_numpy(), input0_np)) tensor_size = 50 * 1024 * 1024 for _ in range(4): - input0_np, infer_response = self._send_identity_tensor(tensor_size) + input0_np, infer_response = self._send_identity_tensor( + tensor_size, self._is_decoupled + ) if infer_response.has_error(): - self.assertIn(out_of_memory_message, - infer_response.error().message()) + self.assertIn(out_of_memory_message, infer_response.error().message()) else: self.assertFalse(infer_response.has_error()) - output0 = pb_utils.get_output_tensor_by_name( - infer_response, 'OUTPUT0') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") self.assertIsNotNone(output0) self.assertTrue(np.allclose(output0.as_numpy(), input0_np)) class TritonPythonModel: - def execute(self, requests): responses = [] for _ in requests: # Run the unittest and store the results in InferenceResponse. - test = unittest.main('model', exit=False) + test = unittest.main("model", exit=False) responses.append( - pb_utils.InferenceResponse([ - pb_utils.Tensor( - 'OUTPUT0', - np.array([test.result.wasSuccessful()], - dtype=np.float16)) - ])) + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", + np.array([test.result.wasSuccessful()], dtype=np.float16), + ) + ] + ) + ) return responses diff --git a/qa/python_models/bls_memory_async/model.py b/qa/python_models/bls_memory_async/model.py index c7eec807b1..d9e676b42e 100644 --- a/qa/python_models/bls_memory_async/model.py +++ b/qa/python_models/bls_memory_async/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,31 +24,42 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +import os + import numpy as np import triton_python_backend_utils as pb_utils -async def _send_identity_tensor(size): +async def _send_identity_tensor(size, is_decoupled): tensor_size = [1, size] input0_np = np.random.randn(*tensor_size) - input0 = pb_utils.Tensor('INPUT0', input0_np.astype(np.float32)) + input0 = pb_utils.Tensor("INPUT0", input0_np.astype(np.float32)) infer_request = pb_utils.InferenceRequest( - model_name='identity_fp32', - inputs=[input0], - requested_output_names=['OUTPUT0']) - return input0_np, await infer_request.async_exec() + model_name="identity_fp32", inputs=[input0], requested_output_names=["OUTPUT0"] + ) + + if is_decoupled: + infer_responses = await infer_request.async_exec(decoupled=True) + infer_response = next(infer_responses) + else: + infer_response = await infer_request.async_exec() + + return input0_np, infer_response async def test_bls_out_of_memory(): - tensor_size = 1024 * 1024 * 1024 - input0_np, infer_response = await _send_identity_tensor(tensor_size) + is_decoupled = True if os.environ["BLS_KIND"] == "decoupled" else False + + tensor_size = 256 * 1024 * 1024 + input0_np, infer_response = await _send_identity_tensor(tensor_size, is_decoupled) + out_of_memory_message = "Failed to increase the shared memory pool size for key" if infer_response.has_error(): if not (out_of_memory_message in infer_response.error().message()): return False else: - output0 = pb_utils.get_output_tensor_by_name(infer_response, 'OUTPUT0') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") if output0 is None: return False if not np.allclose(output0.as_numpy(), input0_np): @@ -56,13 +67,15 @@ async def test_bls_out_of_memory(): tensor_size = 50 * 1024 * 1024 for _ in range(4): - input0_np, infer_response = await _send_identity_tensor(tensor_size) + input0_np, infer_response = await _send_identity_tensor( + tensor_size, is_decoupled + ) + if infer_response.has_error(): if not (out_of_memory_message in infer_response.error().message()): return False else: - output0 = pb_utils.get_output_tensor_by_name( - infer_response, 'OUTPUT0') + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") if output0 is None: return False if not np.allclose(output0.as_numpy(), input0_np): @@ -72,15 +85,14 @@ async def test_bls_out_of_memory(): class TritonPythonModel: - async def execute(self, requests): responses = [] for _ in requests: # Run the unittest and store the results in InferenceResponse. result = await test_bls_out_of_memory() responses.append( - pb_utils.InferenceResponse([ - pb_utils.Tensor('OUTPUT0', - np.array([result], dtype=np.float16)) - ])) + pb_utils.InferenceResponse( + [pb_utils.Tensor("OUTPUT0", np.array([result], dtype=np.float16))] + ) + ) return responses diff --git a/qa/python_models/bls_model_loading/config.pbtxt b/qa/python_models/bls_model_loading/config.pbtxt new file mode 100644 index 0000000000..2099ba5db7 --- /dev/null +++ b/qa/python_models/bls_model_loading/config.pbtxt @@ -0,0 +1,43 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "bls_model_loading" +backend: "python" + +output [ + { + name: "OUTPUT0" + data_type: TYPE_BOOL + dims: [ 1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] diff --git a/qa/python_models/bls_model_loading/model.py b/qa/python_models/bls_model_loading/model.py new file mode 100644 index 0000000000..84162e2fac --- /dev/null +++ b/qa/python_models/bls_model_loading/model.py @@ -0,0 +1,135 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import time +import unittest + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class PBBLSModelLoadingTest(unittest.TestCase): + def setUp(self): + self.model_name = "onnx_int32_int32_int32" + + def tearDown(self): + # The unload call does not wait for the requested model to be fully + # unloaded before returning. + pb_utils.unload_model(self.model_name) + # TODO: Make this more robust to wait until fully unloaded + print("Sleep 30 seconds to make sure model finishes unloading...") + time.sleep(30) + print("Done sleeping.") + + def test_load_unload_model(self): + self.assertFalse(pb_utils.is_model_ready(model_name=self.model_name)) + pb_utils.load_model(model_name=self.model_name) + self.assertTrue(pb_utils.is_model_ready(self.model_name)) + pb_utils.unload_model(self.model_name) + self.assertFalse(pb_utils.is_model_ready(self.model_name)) + + def test_load_with_config_override(self): + self.assertFalse(pb_utils.is_model_ready(self.model_name)) + pb_utils.load_model(self.model_name) + self.assertTrue(pb_utils.is_model_ready(self.model_name)) + + # Send the config with the wrong format + wrong_config = '"parameters": {"config": {{"backend":"onnxruntime", "version_policy":{"specific":{"versions":[2]}}}}}' + with self.assertRaises(pb_utils.TritonModelException): + pb_utils.load_model(model_name=self.model_name, config=wrong_config) + # The model should not be changed after a failed load model request + for version in ["2", "3"]: + self.assertTrue( + pb_utils.is_model_ready( + model_name=self.model_name, model_version=version + ) + ) + + # Send the config with the correct format + config = ( + '{"backend":"onnxruntime", "version_policy":{"specific":{"versions":[2]}}}' + ) + pb_utils.load_model(self.model_name, config=config) + # The model should be changed after a successful load model request + self.assertTrue(pb_utils.is_model_ready(self.model_name, "2")) + self.assertFalse(pb_utils.is_model_ready(self.model_name, "3")) + + def test_load_with_file_override(self): + self.assertFalse(pb_utils.is_model_ready(self.model_name)) + pb_utils.load_model(self.model_name) + self.assertTrue(pb_utils.is_model_ready(self.model_name)) + + override_name = "override_model" + config = '{"backend":"onnxruntime"}' + with open("models/onnx_int32_int32_int32/3/model.onnx", "rb") as file: + data = file.read() + files = {"file:1/model.onnx": data} + + # Request to load the model with override file, should fail without + # providing override config. + with self.assertRaises(pb_utils.TritonModelException): + pb_utils.load_model(self.model_name, "", files) + + # Request to load the model with override file and config in a different name + pb_utils.load_model(model_name=override_name, config=config, files=files) + # Sanity check that the model with original name is unchanged + self.assertFalse(pb_utils.is_model_ready(self.model_name, "1")) + self.assertTrue(pb_utils.is_model_ready(self.model_name, "3")) + + # Check the override model readiness + self.assertTrue(pb_utils.is_model_ready(override_name, "1")) + self.assertFalse(pb_utils.is_model_ready(override_name, "3")) + + # Request to load the model with override file and config in original name + pb_utils.load_model(self.model_name, config, files) + # Check that the model with original name is changed + self.assertTrue(pb_utils.is_model_ready(self.model_name, "1")) + self.assertFalse(pb_utils.is_model_ready(self.model_name, "3")) + + # Sanity check readiness of the different named model + self.assertTrue(pb_utils.is_model_ready(override_name, "1")) + self.assertFalse(pb_utils.is_model_ready(override_name, "3")) + + +class TritonPythonModel: + def initialize(self, args): + # Run the unittest during initialization + test = unittest.main("model", exit=False) + self.result = test.result.wasSuccessful() + + def execute(self, requests): + responses = [] + for _ in requests: + responses.append( + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", np.array([self.result], dtype=np.float16) + ) + ] + ) + ) + return responses diff --git a/qa/python_models/bls_onnx_warmup/config.pbtxt b/qa/python_models/bls_onnx_warmup/config.pbtxt new file mode 100644 index 0000000000..879f85ca81 --- /dev/null +++ b/qa/python_models/bls_onnx_warmup/config.pbtxt @@ -0,0 +1,38 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "bls_onnx_warmup" +backend: "python" + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] + +instance_group [{ kind: KIND_CPU }] \ No newline at end of file diff --git a/qa/python_models/bls_onnx_warmup/model.py b/qa/python_models/bls_onnx_warmup/model.py new file mode 100644 index 0000000000..233bdc85ab --- /dev/null +++ b/qa/python_models/bls_onnx_warmup/model.py @@ -0,0 +1,88 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import unittest + +import numpy as np +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import from_dlpack + + +class PBBLSONNXWarmupTest(unittest.TestCase): + def test_onnx_output_mem_type(self): + input0_np = np.random.randn(*[16]) + input0_np = input0_np.astype(np.float32) + input1_np = np.random.randn(*[16]) + input1_np = input1_np.astype(np.float32) + input0 = pb_utils.Tensor("INPUT0", input0_np) + input1 = pb_utils.Tensor("INPUT1", input1_np) + infer_request = pb_utils.InferenceRequest( + model_name="onnx_nobatch_float32_float32_float32", + inputs=[input0, input1], + requested_output_names=["OUTPUT0", "OUTPUT1"], + ) + + infer_response = infer_request.exec() + + self.assertFalse(infer_response.has_error()) + + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1") + + self.assertIsNotNone(output0) + self.assertIsNotNone(output1) + + # The memory type of output tensor should be GPU + self.assertFalse(output0.is_cpu()) + self.assertFalse(output1.is_cpu()) + + expected_output_0 = input0.as_numpy() - input1.as_numpy() + expected_output_1 = input0.as_numpy() + input1.as_numpy() + + output0 = from_dlpack(output0.to_dlpack()).to("cpu").cpu().detach().numpy() + output1 = from_dlpack(output1.to_dlpack()).to("cpu").cpu().detach().numpy() + + self.assertTrue(np.all(output0 == expected_output_0)) + self.assertTrue(np.all(output1 == expected_output_1)) + + +class TritonPythonModel: + def execute(self, requests): + responses = [] + for _ in requests: + # Run the unittest and store the results in InferenceResponse. + test = unittest.main("model", exit=False) + responses.append( + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", + np.array([test.result.wasSuccessful()], dtype=np.float16), + ) + ] + ) + ) + return responses diff --git a/qa/python_models/bls_parameters/config.pbtxt b/qa/python_models/bls_parameters/config.pbtxt new file mode 100644 index 0000000000..dddf300185 --- /dev/null +++ b/qa/python_models/bls_parameters/config.pbtxt @@ -0,0 +1,52 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "bls_parameters" +backend: "python" +max_batch_size: 0 + +input [ + { + name: "NUMBER_PARAMETERS" + data_type: TYPE_UINT8 + dims: [ 1 ] + } +] + +output [ + { + name: "PARAMETERS_AGGREGATED" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +instance_group [ + { + count: 4 + kind: KIND_CPU + } +] diff --git a/qa/python_models/bls_parameters/model.py b/qa/python_models/bls_parameters/model.py new file mode 100644 index 0000000000..5dc54ebffd --- /dev/null +++ b/qa/python_models/bls_parameters/model.py @@ -0,0 +1,77 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def execute(self, requests): + responses = [] + + for request in requests: + num_params = int( + pb_utils.get_input_tensor_by_name( + request, "NUMBER_PARAMETERS" + ).as_numpy()[0] + ) + params = json.loads(request.parameters()) + + if num_params == 0: + # Base case where the received parameters are returned as JSON + response = json.dumps(params) + response_tensors = [ + pb_utils.Tensor( + "PARAMETERS_AGGREGATED", np.array([response], dtype=np.object_) + ) + ] + else: + # Add the parameters of num_params step to the received parameters + params["bool_" + str(num_params)] = bool(num_params) + params["int_" + str(num_params)] = num_params + params["str_" + str(num_params)] = str(num_params) + # Complete any remaining steps [1, num_params - 1] by calling self + # recursively via BLS + bls_request_tensor = pb_utils.Tensor( + "NUMBER_PARAMETERS", np.array([num_params - 1], dtype=np.ubyte) + ) + bls_request = pb_utils.InferenceRequest( + model_name="bls_parameters", + inputs=[bls_request_tensor], + requested_output_names=["PARAMETERS_AGGREGATED"], + parameters=params, + ) + bls_response = bls_request.exec() + response_tensors = bls_response.output_tensors() + + inference_response = pb_utils.InferenceResponse( + output_tensors=response_tensors + ) + responses.append(inference_response) + + return responses diff --git a/qa/python_models/bls_request_rescheduling/config.pbtxt b/qa/python_models/bls_request_rescheduling/config.pbtxt new file mode 100644 index 0000000000..84f8658f7f --- /dev/null +++ b/qa/python_models/bls_request_rescheduling/config.pbtxt @@ -0,0 +1,38 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "bls_request_rescheduling" +backend: "python" + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/bls_request_rescheduling/model.py b/qa/python_models/bls_request_rescheduling/model.py new file mode 100644 index 0000000000..8615622af9 --- /dev/null +++ b/qa/python_models/bls_request_rescheduling/model.py @@ -0,0 +1,133 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import time +import unittest + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class RequestReschedulingTest(unittest.TestCase): + def _reload_model(self, model_name): + # Reload the model to reset the flag for multiple iterations + pb_utils.unload_model(model_name) + # TODO: Make this more robust to wait until fully unloaded + print("Sleep 10 seconds to make sure model finishes unloading...", flush=True) + time.sleep(10) + print("Done sleeping.", flush=True) + pb_utils.load_model(model_name) + + def test_wrong_return_type(self): + input0 = pb_utils.Tensor("INPUT0", (np.random.randn(*[4])).astype(np.float32)) + infer_request = pb_utils.InferenceRequest( + model_name="wrong_return_type", + inputs=[input0], + requested_output_names=["OUTPUT0"], + ) + + infer_response = infer_request.exec() + self.assertTrue(infer_response.has_error()) + self.assertIn( + "Expected a None object in the execute function return list for reschduled request", + infer_response.error().message(), + ) + + def test_non_decoupled_e2e(self): + model_name = "request_rescheduling_addsub" + self._reload_model(model_name) + + input0_np = np.random.randn(*[16]) + input0_np = input0_np.astype(np.float32) + input1_np = np.random.randn(*[16]) + input1_np = input1_np.astype(np.float32) + input0 = pb_utils.Tensor("INPUT0", input0_np) + input1 = pb_utils.Tensor("INPUT1", input1_np) + infer_request = pb_utils.InferenceRequest( + model_name=model_name, + inputs=[input0, input1], + requested_output_names=["OUTPUT0", "OUTPUT1"], + ) + infer_response = infer_request.exec() + + self.assertFalse(infer_response.has_error()) + + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT0") + output1 = pb_utils.get_output_tensor_by_name(infer_response, "OUTPUT1") + + self.assertIsNotNone(output0) + self.assertIsNotNone(output1) + + expected_output_0 = input0.as_numpy() + input1.as_numpy() + expected_output_1 = input0.as_numpy() - input1.as_numpy() + + self.assertEqual(expected_output_0[0], output0.as_numpy()[0]) + self.assertEqual(expected_output_1[0], output1.as_numpy()[0]) + + def test_decoupled_e2e(self): + model_name = "iterative_sequence" + self._reload_model(model_name) + + input_value = 3 + input0 = pb_utils.Tensor("IN", np.array([input_value], dtype=np.int32)) + infer_request = pb_utils.InferenceRequest( + model_name=model_name, + inputs=[input0], + requested_output_names=["OUT"], + ) + infer_responses = infer_request.exec(decoupled=True) + + expected_output = input_value - 1 + + if infer_responses: + for infer_response in infer_responses: + self.assertFalse(infer_response.has_error()) + + if len(infer_response.output_tensors()) > 0: + output0 = pb_utils.get_output_tensor_by_name(infer_response, "OUT") + self.assertIsNotNone(output0) + + self.assertEqual(expected_output, output0.as_numpy()[0]) + expected_output -= 1 + + +class TritonPythonModel: + def execute(self, requests): + responses = [] + for _ in requests: + # Run the unittest and store the results in InferenceResponse. + test = unittest.main("model", exit=False) + responses.append( + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", + np.array([test.result.wasSuccessful()], dtype=np.float16), + ) + ] + ) + ) + return responses diff --git a/qa/python_models/bls_simple/bls_simple.py b/qa/python_models/bls_simple/bls_simple.py new file mode 100644 index 0000000000..962c3834b9 --- /dev/null +++ b/qa/python_models/bls_simple/bls_simple.py @@ -0,0 +1,84 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + inputs = [ + {"name": "MODEL_NAME", "data_type": "TYPE_STRING", "dims": [1]}, + {"name": "INPUT0", "data_type": "TYPE_INT32", "dims": [1, 16]}, + {"name": "INPUT1", "data_type": "TYPE_INT32", "dims": [1, 16]}, + ] + outputs = [ + {"name": "OUTPUT0", "data_type": "TYPE_INT32", "dims": [16]}, + {"name": "OUTPUT1", "data_type": "TYPE_INT32", "dims": [16]}, + ] + + config = auto_complete_model_config.as_dict() + input_names = [] + output_names = [] + for input in config["input"]: + input_names.append(input["name"]) + for output in config["output"]: + output_names.append(output["name"]) + + for input in inputs: + if input["name"] not in input_names: + auto_complete_model_config.add_input(input) + for output in outputs: + if output["name"] not in output_names: + auto_complete_model_config.add_output(output) + + auto_complete_model_config.set_max_batch_size(0) + + return auto_complete_model_config + + def execute(self, requests): + responses = [] + for request in requests: + in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") + in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") + model_name = pb_utils.get_input_tensor_by_name(request, "MODEL_NAME") + model_name_string = model_name.as_numpy()[0] + + infer_request = pb_utils.InferenceRequest( + model_name=model_name_string, + requested_output_names=["OUTPUT0", "OUTPUT1"], + inputs=[in_0, in_1], + trace=request.trace(), + ) + + infer_response = infer_request.exec() + + inference_response = pb_utils.InferenceResponse( + output_tensors=infer_response.output_tensors() + ) + responses.append(inference_response) + + return responses diff --git a/qa/python_models/bls_undefined/config.pbtxt b/qa/python_models/bls_undefined/config.pbtxt new file mode 100644 index 0000000000..ab873d8a64 --- /dev/null +++ b/qa/python_models/bls_undefined/config.pbtxt @@ -0,0 +1,50 @@ +# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "bls_undefined" +backend: "python" + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ -1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ -1 ] + } +] + +instance_group [{ + kind: KIND_CPU, + count: 2 +}] + diff --git a/qa/python_models/bls_undefined/model.py b/qa/python_models/bls_undefined/model.py new file mode 100644 index 0000000000..30e5f4106a --- /dev/null +++ b/qa/python_models/bls_undefined/model.py @@ -0,0 +1,33 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +class TritonPythonModel: + def execute(self, requests): + undefined_variable + + def finalize(self): + print("Cleaning up...") diff --git a/qa/python_models/cuda_memory_consumer/1/model.py b/qa/python_models/cuda_memory_consumer/1/model.py new file mode 100644 index 0000000000..e3526920ea --- /dev/null +++ b/qa/python_models/cuda_memory_consumer/1/model.py @@ -0,0 +1,69 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import triton_python_backend_utils as pb_utils +from cuda import cuda + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + input = {"name": "INPUT", "data_type": "TYPE_FP32", "dims": [1]} + output = {"name": "OUTPUT", "data_type": "TYPE_FP32", "dims": [1]} + + auto_complete_model_config.set_max_batch_size(0) + auto_complete_model_config.add_input(input) + auto_complete_model_config.add_output(output) + + return auto_complete_model_config + + def initialize(self, args): + self.mem_ptr = None + # Initialize CUDA context + cuda.cuInit(0) + cuda.cuCtxCreate(0, 0) + + mem_info = cuda.cuMemGetInfo() + if mem_info[0] != 0: + raise pb_utils.TritonModelException("Failed to get CUDA memory info") + + mem_alloc = cuda.cuMemAlloc(mem_info[2] * 0.4) + if mem_alloc[0] != 0: + raise pb_utils.TritonModelException("Failed to allocate CUDA memory") + self.mem_ptr = mem_alloc[1] + + def finalize(self): + if self.mem_ptr is not None: + cuda.cuMemFree(self.mem_ptr) + + def execute(self, requests): + """This function is called on inference request.""" + responses = [] + for request in requests: + input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") + out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy()) + responses.append(pb_utils.InferenceResponse([out_tensor])) + return responses diff --git a/qa/python_models/cuda_memory_consumer/config.pbtxt b/qa/python_models/cuda_memory_consumer/config.pbtxt new file mode 100644 index 0000000000..b1e0348433 --- /dev/null +++ b/qa/python_models/cuda_memory_consumer/config.pbtxt @@ -0,0 +1,28 @@ +# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +backend: "python" +instance_group [{ kind: KIND_GPU, gpus: [0] }] diff --git a/qa/python_models/custom_metrics/config.pbtxt b/qa/python_models/custom_metrics/config.pbtxt new file mode 100644 index 0000000000..c2bf81331b --- /dev/null +++ b/qa/python_models/custom_metrics/config.pbtxt @@ -0,0 +1,43 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "custom_metrics" +backend: "python" + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] + +instance_group [ + { + count: 3 + kind: KIND_CPU + } +] diff --git a/qa/python_models/custom_metrics/model.py b/qa/python_models/custom_metrics/model.py new file mode 100644 index 0000000000..31f105a1dd --- /dev/null +++ b/qa/python_models/custom_metrics/model.py @@ -0,0 +1,278 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import unittest + +import numpy as np +import requests +import triton_python_backend_utils as pb_utils + + +class PBCustomMetricsTest(unittest.TestCase): + def _get_metrics(self): + metrics_url = "http://localhost:8002/metrics" + r = requests.get(metrics_url) + r.raise_for_status() + return r.text + + def _metric_api_helper(self, metric, kind): + # Adding logger to test if custom metrics and logging work together + # as they use the same message queue. + logger = pb_utils.Logger + + # The value should be 0.0 before the test + self.assertEqual(metric.value(), 0.0) + + # Test increment positive value + increment = 2023.0 + metric.increment(increment) + self.assertEqual(metric.value(), increment) + logger.log_info("Incremented metric to : {}".format(metric.value())) + + # Test increment negative value + decrement = -23.5 + if kind == "counter": + # Counter should not accept negative values + with self.assertRaises(pb_utils.TritonModelException): + metric.increment(decrement) + else: + metric.increment(decrement) + self.assertEqual(metric.value(), increment + decrement) + logger.log_info("Decremented metric to : {}".format(metric.value())) + + # Test set value + value = 999.9 + if kind == "counter": + # Counter does not support set + with self.assertRaises(pb_utils.TritonModelException): + metric.set(value) + else: + metric.set(value) + self.assertEqual(metric.value(), value) + logger.log_info("Set metric to : {}".format(metric.value())) + + def _dup_metric_helper(self, labels={}): + # Adding logger to test if custom metrics and logging work together + # as they use the same message queue. + logger = pb_utils.Logger + + description = "dup metric" + metric_family = pb_utils.MetricFamily( + name="test_dup_metric", + description=description, + kind=pb_utils.MetricFamily.COUNTER, + ) + + # Verify dupe metrics reference same underlying metric + metric1 = metric_family.Metric(labels=labels) + metric2 = metric_family.Metric(labels=labels) + + # The value should be 0 before the test + self.assertEqual(metric1.value(), 0.0) + self.assertEqual(metric2.value(), 0.0) + + # Increment metric 1, check metric 2 == metric 1 + increment = 7.5 + metric1.increment(increment) + self.assertEqual(metric1.value(), metric2.value()) + logger.log_info("Incremented metric1 to : {}".format(metric1.value())) + logger.log_info("Incremented metric2 to : {}".format(metric2.value())) + + # Assert custom metric/family remains when there's still a reference to it + del metric1 + metrics = self._get_metrics() + self.assertIn(description, metrics) + + def test_counter_e2e(self): + metric_family = pb_utils.MetricFamily( + name="test_counter_e2e", + description="test metric counter kind end to end", + kind=pb_utils.MetricFamily.COUNTER, + ) + labels = {"example1": "counter_label1", "example2": "counter_label2"} + metric = metric_family.Metric(labels=labels) + self._metric_api_helper(metric, "counter") + + pattern = ( + 'test_counter_e2e{example1="counter_label1",example2="counter_label2"}' + ) + metrics = self._get_metrics() + self.assertIn(pattern, metrics) + + def test_gauge_e2e(self): + metric_family = pb_utils.MetricFamily( + name="test_gauge_e2e", + description="test metric gauge kind end to end", + kind=pb_utils.MetricFamily.GAUGE, + ) + labels = {"example1": "counter_label1", "example2": "counter_label2"} + metric = metric_family.Metric(labels=labels) + self._metric_api_helper(metric, "gauge") + + pattern = 'test_gauge_e2e{example1="counter_label1",example2="counter_label2"}' + metrics = self._get_metrics() + self.assertIn(pattern, metrics) + + def test_dup_metric_family_diff_kind(self): + # Test that a duplicate metric family can't be added with a conflicting type/kind + metric_family1 = pb_utils.MetricFamily( + name="test_dup_metric_family_diff_kind", + description="test metric family with same name but different kind", + kind=pb_utils.MetricFamily.COUNTER, + ) + with self.assertRaises(pb_utils.TritonModelException): + metric_family2 = pb_utils.MetricFamily( + name="test_dup_metric_family_diff_kind", + description="test metric family with same name but different kind", + kind=pb_utils.MetricFamily.GAUGE, + ) + self.assertIsNone(metric_family2) + + self.assertIsNotNone(metric_family1) + + def test_dup_metric_family_diff_description(self): + # Test that a duplicate metric family name will still return the + # original metric family even if the description is changed + metric_family1 = pb_utils.MetricFamily( + name="test_dup_metric_family_diff_description", + description="first description", + kind=pb_utils.MetricFamily.COUNTER, + ) + metric_family2 = pb_utils.MetricFamily( + name="test_dup_metric_family_diff_description", + description="second description", + kind=pb_utils.MetricFamily.COUNTER, + ) + + metric2 = metric_family2.Metric() + self.assertEqual(metric2.value(), 0) + + # Delete metric_family1 and check if metric_family2 still references it + del metric_family1 + pattern = "test_dup_metric_family_diff_description first description" + metrics = self._get_metrics() + self.assertIn(pattern, metrics) + + # The first description will be kept if adding a duplicate metric + # family name with a different description + pattern = "test_dup_metric_family_diff_description second description" + self.assertNotIn(pattern, metrics) + + def test_dup_metric_family(self): + # Test that adding a duplicate metric family will reuse the original + # and not add another entry to registry + metric_family1 = pb_utils.MetricFamily( + name="test_dup_metric_family", + description="dup description", + kind=pb_utils.MetricFamily.COUNTER, + ) + metric_family2 = pb_utils.MetricFamily( + name="test_dup_metric_family", + description="dup description", + kind=pb_utils.MetricFamily.COUNTER, + ) + + metric_key = "custom_metric_key" + metric1 = metric_family1.Metric(labels={metric_key: "label1"}) + metric2 = metric_family2.Metric(labels={metric_key: "label2"}) + + self.assertEqual(metric1.value(), 0) + self.assertEqual(metric2.value(), 0) + + patterns = [ + "# HELP test_dup_metric_family dup description", + "# TYPE test_dup_metric_family counter", + 'test_dup_metric_family{custom_metric_key="label2"} 0', + 'test_dup_metric_family{custom_metric_key="label1"} 0', + ] + metrics = self._get_metrics() + for pattern in patterns: + self.assertIn(pattern, metrics) + + def test_dup_metric_labels(self): + # Test that adding a duplicate metric will refer to the same + # underlying metric, and all instances will be updated + labels = {"example1": "label1", "example2": "label2"} + self._dup_metric_helper(labels) + + def test_dup_metric_empty_labels(self): + # Test that adding a duplicate metric will refer to the same + # underlying metric, and all instances will be updated + self._dup_metric_helper() + + def test_metric_lifetime_error(self): + # Test the error handling when the corresponding 'MetricFamily' is + # deleted before the 'Metric' is deleted, and the 'Metric' is still + # being used for metric operations + kinds = [pb_utils.MetricFamily.COUNTER, pb_utils.MetricFamily.GAUGE] + metric_family_names = [ + "test_metric_lifetime_error_counter", + "test_metric_lifetime_error_gauge", + ] + for kind, name in zip(kinds, metric_family_names): + metric_family = pb_utils.MetricFamily( + name=name, description="test metric lifetime error", kind=kind + ) + labels = {"example1": "counter_label1", "example2": "counter_label2"} + metric = metric_family.Metric(labels=labels) + + # Intentionally delete the 'MetricFamily' before the 'Metric' being deleted + del metric_family + + error_msg = "Invalid metric operation as the corresponding 'MetricFamily' has been deleted." + + # Counter does not support set + if kind is not pb_utils.MetricFamily.COUNTER: + with self.assertRaises(pb_utils.TritonModelException) as ex: + metric.set(10) + self.assertIn(error_msg, str(ex.exception)) + + with self.assertRaises(pb_utils.TritonModelException) as ex: + metric.increment(10) + self.assertIn(error_msg, str(ex.exception)) + + with self.assertRaises(pb_utils.TritonModelException) as ex: + metric.value() + self.assertIn(error_msg, str(ex.exception)) + + +class TritonPythonModel: + def execute(self, requests): + responses = [] + for _ in requests: + # Run the unittest and store the results in InferenceResponse. + test = unittest.main("model", exit=False) + responses.append( + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", + np.array([test.result.wasSuccessful()], dtype=np.float16), + ) + ] + ) + ) + return responses diff --git a/qa/python_models/delayed_model/model.py b/qa/python_models/delayed_model/model.py index 639497f542..e7538148f1 100644 --- a/qa/python_models/delayed_model/model.py +++ b/qa/python_models/delayed_model/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,21 +24,21 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils import time +import triton_python_backend_utils as pb_utils + # Sleep for 5 seconds to ensure that delayed startup works properly. time.sleep(5) class TritonPythonModel: - def execute(self, requests): responses = [] for request in requests: input_tensor = pb_utils.get_input_tensor_by_name(request, "IN") - out_tensor = utils.Tensor("OUT", input_tensor.as_numpy()) - responses.append(utils.InferenceResponse([out_tensor])) + out_tensor = pb_utils.Tensor("OUT", input_tensor.as_numpy()) + responses.append(pb_utils.InferenceResponse([out_tensor])) return responses def finalize(self): diff --git a/qa/python_models/dlpack_add_sub/model.py b/qa/python_models/dlpack_add_sub/model.py index e32e31c9a8..7f70e05d5c 100644 --- a/qa/python_models/dlpack_add_sub/model.py +++ b/qa/python_models/dlpack_add_sub/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,27 +24,27 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils -from torch.utils.dlpack import to_dlpack, from_dlpack -import torch -import numpy as np import json +import numpy as np +import torch +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import from_dlpack, to_dlpack + class TritonPythonModel: - def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) self.numpy_to_pytorch_dtype = { np.bool_: torch.bool, np.uint8: torch.uint8, @@ -68,52 +68,63 @@ def execute(self, requests): # If both of the tensors are in CPU, use NumPy. if in_0.is_cpu() and in_1.is_cpu(): - if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy( - ).dtype == np.object_: - out_0, out_1 = (in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32),\ - in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32)) - out_tensor_0 = pb_utils.Tensor("OUTPUT0", - out_0.astype(output0_dtype)) - out_tensor_1 = pb_utils.Tensor("OUTPUT1", - out_1.astype(output1_dtype)) + if ( + in_0.as_numpy().dtype.type is np.bytes_ + or in_0.as_numpy().dtype == np.object_ + ): + out_0, out_1 = ( + in_0.as_numpy().astype(np.int32) + + in_1.as_numpy().astype(np.int32), + in_0.as_numpy().astype(np.int32) + - in_1.as_numpy().astype(np.int32), + ) + out_tensor_0 = pb_utils.Tensor( + "OUTPUT0", out_0.astype(output0_dtype) + ) + out_tensor_1 = pb_utils.Tensor( + "OUTPUT1", out_1.astype(output1_dtype) + ) else: in_0_pytorch, in_1_pytorch = from_dlpack( - in_0.to_dlpack()), from_dlpack(in_1.to_dlpack()) - out_0, out_1 = (in_0_pytorch + in_1_pytorch, - in_0_pytorch - in_1_pytorch) + in_0.to_dlpack() + ), from_dlpack(in_1.to_dlpack()) + out_0, out_1 = ( + in_0_pytorch + in_1_pytorch, + in_0_pytorch - in_1_pytorch, + ) if self.output0_dtype == np.object_: out_tensor_0 = pb_utils.Tensor( - "OUTPUT0", - out_0.numpy().astype(output0_dtype)) + "OUTPUT0", out_0.numpy().astype(output0_dtype) + ) else: - out_0 = out_0.type( - self.numpy_to_pytorch_dtype[output0_dtype]) + out_0 = out_0.type(self.numpy_to_pytorch_dtype[output0_dtype]) out_tensor_0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(out_0)) + "OUTPUT0", to_dlpack(out_0) + ) if self.output1_dtype == np.object_: out_tensor_1 = pb_utils.Tensor( - "OUTPUT1", - out_1.numpy().astype(output1_dtype)) + "OUTPUT1", out_1.numpy().astype(output1_dtype) + ) else: - out_1 = out_1.type( - self.numpy_to_pytorch_dtype[output1_dtype]) + out_1 = out_1.type(self.numpy_to_pytorch_dtype[output1_dtype]) out_tensor_1 = pb_utils.Tensor.from_dlpack( - "OUTPUT1", to_dlpack(out_1)) + "OUTPUT1", to_dlpack(out_1) + ) else: - in_0_pytorch, in_1_pytorch = from_dlpack( - in_0.to_dlpack()).cuda(), from_dlpack( - in_1.to_dlpack()).cuda() - out_0, out_1 = (in_0_pytorch + in_1_pytorch, - in_0_pytorch - in_1_pytorch) - out_tensor_0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(out_0)) - out_tensor_1 = pb_utils.Tensor.from_dlpack( - "OUTPUT1", to_dlpack(out_1)) + in_0_pytorch, in_1_pytorch = ( + from_dlpack(in_0.to_dlpack()).cuda(), + from_dlpack(in_1.to_dlpack()).cuda(), + ) + out_0, out_1 = ( + in_0_pytorch + in_1_pytorch, + in_0_pytorch - in_1_pytorch, + ) + out_tensor_0 = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(out_0)) + out_tensor_1 = pb_utils.Tensor.from_dlpack("OUTPUT1", to_dlpack(out_1)) - responses.append( - pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) + responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) return responses diff --git a/qa/python_models/dlpack_empty_output/config.pbtxt b/qa/python_models/dlpack_empty_output/config.pbtxt new file mode 100644 index 0000000000..d026db1cd1 --- /dev/null +++ b/qa/python_models/dlpack_empty_output/config.pbtxt @@ -0,0 +1,43 @@ +# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "dlpack_empty_output" +max_batch_size: 8 + +input [ + { + name: "INPUT" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] +output [ + { + name: "OUTPUT" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] diff --git a/qa/python_models/dlpack_empty_output/model.py b/qa/python_models/dlpack_empty_output/model.py new file mode 100644 index 0000000000..7784e28b4d --- /dev/null +++ b/qa/python_models/dlpack_empty_output/model.py @@ -0,0 +1,53 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import torch +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import to_dlpack + + +class TritonPythonModel: + def initialize(self, args): + pass + + def execute(self, requests): + responses = [] + + for _ in requests: + SHAPE = (0,) + + pytorch_tensor = torch.ones(SHAPE, dtype=torch.float32) + + device = torch.device("cuda:0") + pytorch_tensor = pytorch_tensor.to(device) + + dlpack_tensor = to_dlpack(pytorch_tensor) + pb_tensor = pb_utils.Tensor.from_dlpack("OUTPUT", dlpack_tensor) + + inference_response = pb_utils.InferenceResponse(output_tensors=[pb_tensor]) + responses.append(inference_response) + + return responses diff --git a/qa/python_models/dlpack_identity/model.py b/qa/python_models/dlpack_identity/model.py index 9057180381..1bd0748df9 100644 --- a/qa/python_models/dlpack_identity/model.py +++ b/qa/python_models/dlpack_identity/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,7 +28,6 @@ class TritonPythonModel: - def execute(self, requests): """Identity model in Python backend that works with GPU and CPU tensors.""" @@ -36,7 +35,8 @@ def execute(self, requests): responses = [] for request in requests: input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") - out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", input_tensor.to_dlpack()) + out_tensor = pb_utils.Tensor.from_dlpack( + "OUTPUT0", input_tensor.to_dlpack() + ) responses.append(pb_utils.InferenceResponse([out_tensor])) return responses - diff --git a/qa/python_models/dlpack_io_identity/model.py b/qa/python_models/dlpack_io_identity/model.py index f98a4f51c4..225d026992 100644 --- a/qa/python_models/dlpack_io_identity/model.py +++ b/qa/python_models/dlpack_io_identity/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,9 +24,9 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils -from torch.utils.dlpack import to_dlpack, from_dlpack import numpy as np +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import from_dlpack, to_dlpack class TritonPythonModel: @@ -36,70 +36,73 @@ class TritonPythonModel: """ def initialize(self, args): - self._model_name = args['model_name'] + self._model_name = args["model_name"] def execute(self, requests): responses = [] for request in requests: input0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") gpu_output = pb_utils.get_input_tensor_by_name( - request, "GPU_OUTPUT").as_numpy() + request, "GPU_OUTPUT" + ).as_numpy() if input0.is_cpu(): if not gpu_output[0]: - output0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", input0.to_dlpack()) + output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack()) else: outptu0_pytorch = from_dlpack(input0.to_dlpack()).cuda() output0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(outptu0_pytorch)) + "OUTPUT0", to_dlpack(outptu0_pytorch) + ) else: if gpu_output[0]: - output0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", input0.to_dlpack()) + output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack()) else: outptu0_pytorch = from_dlpack(input0.to_dlpack()).cpu() output0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(outptu0_pytorch)) + "OUTPUT0", to_dlpack(outptu0_pytorch) + ) next_gpu_output = pb_utils.Tensor("NEXT_GPU_OUTPUT", gpu_output[1:]) # Do not perform BLS inference if it is the first # model in the pipeline. - if self._model_name != 'dlpack_io_identity_1': + if self._model_name != "dlpack_io_identity_1": infer_request = pb_utils.InferenceRequest( - model_name='dlpack_io_identity_1', + model_name="dlpack_io_identity_1", inputs=[ input0, - pb_utils.get_input_tensor_by_name( - request, "GPU_OUTPUT") + pb_utils.get_input_tensor_by_name(request, "GPU_OUTPUT"), ], - requested_output_names=['OUTPUT0']) + requested_output_names=["OUTPUT0"], + ) infer_response = infer_request.exec() if infer_response.has_error(): raise pb_utils.TritonModelException( - infer_response.error().message()) + infer_response.error().message() + ) bls_output0 = pb_utils.get_output_tensor_by_name( - infer_response, 'OUTPUT0') + infer_response, "OUTPUT0" + ) if not output0.is_cpu(): - bls_output0 = from_dlpack( - bls_output0.to_dlpack()).detach().cpu().numpy() + bls_output0 = ( + from_dlpack(bls_output0.to_dlpack()).detach().cpu().numpy() + ) else: bls_output0 = bls_output0.as_numpy() if not input0.is_cpu(): - input0 = from_dlpack( - input0.to_dlpack()).detach().cpu().numpy() + input0 = from_dlpack(input0.to_dlpack()).detach().cpu().numpy() else: input0 = input0.as_numpy() if not np.allclose(bls_output0, input0): raise pb_utils.TritonModelException( - 'BLS input and output tensors are not equal') + "BLS input and output tensors are not equal" + ) - responses.append( - pb_utils.InferenceResponse([output0, next_gpu_output])) + responses.append(pb_utils.InferenceResponse([output0, next_gpu_output])) return responses diff --git a/qa/python_models/dlpack_io_identity_decoupled/model.py b/qa/python_models/dlpack_io_identity_decoupled/model.py index 5b4de86e60..5f4e597df8 100644 --- a/qa/python_models/dlpack_io_identity_decoupled/model.py +++ b/qa/python_models/dlpack_io_identity_decoupled/model.py @@ -1,4 +1,4 @@ -# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,11 +24,11 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils -from torch.utils.dlpack import to_dlpack, from_dlpack -import numpy as np -import time import threading +import time + +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import from_dlpack, to_dlpack class TritonPythonModel: @@ -38,7 +38,7 @@ class TritonPythonModel: """ def initialize(self, args): - self._model_name = args['model_name'] + self._model_name = args["model_name"] self.inflight_thread_count = 0 self.inflight_thread_count_lck = threading.Lock() @@ -48,20 +48,20 @@ def response_thread(self, response_sender, input0, gpu_output): if input0.is_cpu(): if not gpu_output[0]: - output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", - input0.to_dlpack()) + output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack()) else: outptu0_pytorch = from_dlpack(input0.to_dlpack()).cuda() output0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(outptu0_pytorch)) + "OUTPUT0", to_dlpack(outptu0_pytorch) + ) else: if gpu_output[0]: - output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", - input0.to_dlpack()) + output0 = pb_utils.Tensor.from_dlpack("OUTPUT0", input0.to_dlpack()) else: output0_pytorch = from_dlpack(input0.to_dlpack()).cpu() output0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(output0_pytorch)) + "OUTPUT0", to_dlpack(output0_pytorch) + ) next_gpu_output = pb_utils.Tensor("NEXT_GPU_OUTPUT", gpu_output[1:]) infer_response = pb_utils.InferenceResponse([output0, next_gpu_output]) @@ -71,8 +71,7 @@ def response_thread(self, response_sender, input0, gpu_output): for _ in range(response_repeat): response_sender.send(infer_response) - response_sender.send( - flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) with self.inflight_thread_count_lck: self.inflight_thread_count -= 1 @@ -81,11 +80,13 @@ def execute(self, requests): for request in requests: input0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") gpu_output = pb_utils.get_input_tensor_by_name( - request, "GPU_OUTPUT").as_numpy() + request, "GPU_OUTPUT" + ).as_numpy() - thread = threading.Thread(target=self.response_thread, - args=(request.get_response_sender(), - input0, gpu_output)) + thread = threading.Thread( + target=self.response_thread, + args=(request.get_response_sender(), input0, gpu_output), + ) thread.daemon = True @@ -99,11 +100,11 @@ def finalize(self): cycles = 0 logging_time_sec = 5 sleep_time_sec = 0.1 - cycle_to_log = (logging_time_sec / sleep_time_sec) + cycle_to_log = logging_time_sec / sleep_time_sec while inflight_threads: with self.inflight_thread_count_lck: - inflight_threads = (self.inflight_thread_count != 0) - if (cycles % cycle_to_log == 0): + inflight_threads = self.inflight_thread_count != 0 + if cycles % cycle_to_log == 0: print( f"Waiting for {self.inflight_thread_count} response threads to complete..." ) diff --git a/qa/python_models/dlpack_square/config.pbtxt b/qa/python_models/dlpack_square/config.pbtxt new file mode 100644 index 0000000000..15cf6b7fd2 --- /dev/null +++ b/qa/python_models/dlpack_square/config.pbtxt @@ -0,0 +1,48 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "dlpack_square" +backend: "python" +max_batch_size: 0 +model_transaction_policy { + decoupled: True +} +input [ + { + name: "IN" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] +output [ + { + name: "OUT" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] +instance_group [{ kind: KIND_CPU }] + diff --git a/qa/python_models/dlpack_square/model.py b/qa/python_models/dlpack_square/model.py new file mode 100644 index 0000000000..b31531461e --- /dev/null +++ b/qa/python_models/dlpack_square/model.py @@ -0,0 +1,139 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json +import threading + +import numpy as np +import torch + +# triton_python_backend_utils is available in every Triton Python model. You +# need to use this module to create inference requests and responses. It also +# contains some utility functions for extracting information from model_config +# and converting Triton input/output types to numpy types. +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import from_dlpack, to_dlpack + +numpy_to_pytorch_dtype = { + np.bool_: torch.bool, + np.uint8: torch.uint8, + np.int8: torch.int8, + np.int16: torch.int16, + np.int32: torch.int32, + np.int64: torch.int64, + np.float16: torch.float16, + np.float32: torch.float32, + np.float64: torch.float64, +} + + +class TritonPythonModel: + def initialize(self, args): + self.model_config = model_config = json.loads(args["model_config"]) + + output_config = pb_utils.get_output_config_by_name(model_config, "OUT") + self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + + using_decoupled = pb_utils.using_decoupled_model_transaction_policy( + model_config + ) + if not using_decoupled: + raise pb_utils.TritonModelException( + """the model `{}` can generate any number of responses per request, + enable decoupled transaction policy in model configuration to + serve this model""".format( + args["model_name"] + ) + ) + + self.inflight_thread_count = 0 + self.inflight_thread_count_lck = threading.Lock() + + def execute(self, requests): + for request in requests: + self.process_request(request) + + return None + + def process_request(self, request): + # Start a separate thread to send the responses for the request. The + # sending back the responses is delegated to this thread. + thread = threading.Thread( + target=self.response_thread, + args=( + request.get_response_sender(), + pb_utils.get_input_tensor_by_name(request, "IN"), + self.output_dtype, + ), + ) + + thread.daemon = True + + with self.inflight_thread_count_lck: + self.inflight_thread_count += 1 + + thread.start() + + def response_thread(self, response_sender, in_input, output_dtype): + # The response_sender is used to send response(s) associated with the + # corresponding request. + + for idx in range(in_input.as_numpy()[0]): + if in_input.is_cpu(): + if ( + in_input.as_numpy().dtype.type is np.bytes_ + or in_input.as_numpy().dtype == np.object_ + ): + out_0 = in_input.as_numpy().astype(np.int32) + out_tensor = pb_utils.Tensor("OUT", out_0.astype(output_dtype)) + else: + in_0_pytorch = from_dlpack(in_input.to_dlpack()) + out_0 = in_0_pytorch + if output_dtype == np.object_: + out_tensor = pb_utils.Tensor( + "OUT", out_0.numpy().astype(output_dtype) + ) + else: + out_0 = out_0.type(numpy_to_pytorch_dtype[output_dtype]) + out_tensor = pb_utils.Tensor.from_dlpack( + "OUT", to_dlpack(out_0) + ) + else: + in_0_pytorch = from_dlpack(in_input.to_dlpack()).cuda() + out_0 = in_0_pytorch + out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(out_0)) + + response = pb_utils.InferenceResponse(output_tensors=[out_tensor]) + response_sender.send(response) + + # We must close the response sender to indicate to Triton that we are + # done sending responses for the corresponding request. We can't use the + # response sender after closing it. The response sender is closed by + # setting the TRITONSERVER_RESPONSE_COMPLETE_FINAL. + response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + + with self.inflight_thread_count_lck: + self.inflight_thread_count -= 1 diff --git a/qa/python_models/dlpack_sub_add/model.py b/qa/python_models/dlpack_sub_add/model.py index af07874a9f..16caafcea2 100644 --- a/qa/python_models/dlpack_sub_add/model.py +++ b/qa/python_models/dlpack_sub_add/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,27 +24,27 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils -from torch.utils.dlpack import to_dlpack, from_dlpack -import torch -import numpy as np import json +import numpy as np +import torch +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import from_dlpack, to_dlpack + class TritonPythonModel: - def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) self.numpy_to_pytorch_dtype = { np.bool_: torch.bool, np.uint8: torch.uint8, @@ -68,52 +68,63 @@ def execute(self, requests): # If both of the tensors are in CPU, use NumPy. if in_0.is_cpu() and in_1.is_cpu(): - if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy( - ).dtype == np.object_: - out_0, out_1 = (in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),\ - in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32)) - out_tensor_0 = pb_utils.Tensor("OUTPUT0", - out_0.astype(output0_dtype)) - out_tensor_1 = pb_utils.Tensor("OUTPUT1", - out_1.astype(output1_dtype)) + if ( + in_0.as_numpy().dtype.type is np.bytes_ + or in_0.as_numpy().dtype == np.object_ + ): + out_0, out_1 = ( + in_0.as_numpy().astype(np.int32) + - in_1.as_numpy().astype(np.int32), + in_0.as_numpy().astype(np.int32) + + in_1.as_numpy().astype(np.int32), + ) + out_tensor_0 = pb_utils.Tensor( + "OUTPUT0", out_0.astype(output0_dtype) + ) + out_tensor_1 = pb_utils.Tensor( + "OUTPUT1", out_1.astype(output1_dtype) + ) else: in_0_pytorch, in_1_pytorch = from_dlpack( - in_0.to_dlpack()), from_dlpack(in_1.to_dlpack()) - out_0, out_1 = (in_0_pytorch - in_1_pytorch, - in_0_pytorch + in_1_pytorch) + in_0.to_dlpack() + ), from_dlpack(in_1.to_dlpack()) + out_0, out_1 = ( + in_0_pytorch - in_1_pytorch, + in_0_pytorch + in_1_pytorch, + ) if self.output0_dtype == np.object_: out_tensor_0 = pb_utils.Tensor( - "OUTPUT0", - out_0.numpy().astype(output0_dtype)) + "OUTPUT0", out_0.numpy().astype(output0_dtype) + ) else: - out_0 = out_0.type( - self.numpy_to_pytorch_dtype[output0_dtype]) + out_0 = out_0.type(self.numpy_to_pytorch_dtype[output0_dtype]) out_tensor_0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(out_0)) + "OUTPUT0", to_dlpack(out_0) + ) if self.output1_dtype == np.object_: out_tensor_1 = pb_utils.Tensor( - "OUTPUT1", - out_1.numpy().astype(output1_dtype)) + "OUTPUT1", out_1.numpy().astype(output1_dtype) + ) else: - out_1 = out_1.type( - self.numpy_to_pytorch_dtype[output1_dtype]) + out_1 = out_1.type(self.numpy_to_pytorch_dtype[output1_dtype]) out_tensor_1 = pb_utils.Tensor.from_dlpack( - "OUTPUT1", to_dlpack(out_1)) + "OUTPUT1", to_dlpack(out_1) + ) else: - in_0_pytorch, in_1_pytorch = from_dlpack( - in_0.to_dlpack()).cuda(), from_dlpack( - in_1.to_dlpack()).cuda() - out_0, out_1 = (in_0_pytorch - in_1_pytorch, - in_0_pytorch + in_1_pytorch) - out_tensor_0 = pb_utils.Tensor.from_dlpack( - "OUTPUT0", to_dlpack(out_0)) - out_tensor_1 = pb_utils.Tensor.from_dlpack( - "OUTPUT1", to_dlpack(out_1)) + in_0_pytorch, in_1_pytorch = ( + from_dlpack(in_0.to_dlpack()).cuda(), + from_dlpack(in_1.to_dlpack()).cuda(), + ) + out_0, out_1 = ( + in_0_pytorch - in_1_pytorch, + in_0_pytorch + in_1_pytorch, + ) + out_tensor_0 = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(out_0)) + out_tensor_1 = pb_utils.Tensor.from_dlpack("OUTPUT1", to_dlpack(out_1)) - responses.append( - pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) + responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) return responses diff --git a/qa/python_models/dlpack_test/model.py b/qa/python_models/dlpack_test/model.py index cd3ab37c7d..64bc7d6692 100644 --- a/qa/python_models/dlpack_test/model.py +++ b/qa/python_models/dlpack_test/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,44 +24,61 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np import unittest + +import cupy as cp +import numpy as np import torch -from torch.utils.dlpack import from_dlpack, to_dlpack import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import from_dlpack, to_dlpack class PBTensorTest(unittest.TestCase): - def test_pytorch_dlpack(self): # Test different dtypes pytorch_dtypes = [ - torch.float16, torch.float32, torch.float64, torch.int8, - torch.int16, torch.int32, torch.int64, torch.uint8, torch.bool + torch.float16, + torch.float32, + torch.float64, + torch.int8, + torch.int16, + torch.int32, + torch.int64, + torch.uint8, ] for pytorch_dtype in pytorch_dtypes: - pytorch_tensor = torch.rand([100], dtype=torch.float16) * 100 - pytorch_tensor = pytorch_tensor.type(pytorch_dtype) + pytorch_tensor = torch.ones([100], dtype=pytorch_dtype) dlpack_tensor = to_dlpack(pytorch_tensor) - pb_tensor = pb_utils.Tensor.from_dlpack('test_tensor', - dlpack_tensor) + pb_tensor = pb_utils.Tensor.from_dlpack("test_tensor", dlpack_tensor) self.assertTrue( - np.all(pb_tensor.as_numpy() == pytorch_tensor.numpy())) + np.array_equal(pb_tensor.as_numpy(), pytorch_tensor.numpy()) + ) # Convert the tensor back to DLPack and ensure that both tensors are # the same pytorch_tensor_dlpack = from_dlpack(pb_tensor.to_dlpack()) - self.assertTrue(torch.all(pytorch_tensor_dlpack == pytorch_tensor)) + self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor)) + + self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type()) + + # Now let's check that upgraded DLPack implementation also + # works as expected, i.e. from_dlpack should work with + # external pytorch tensor directly - # DLPack does not properly support bool type: - # https://github.com/google/jax/issues/4719 - if pytorch_dtype != torch.bool: - self.assertTrue( - pytorch_tensor.type() == pytorch_tensor_dlpack.type()) - else: - self.assertFalse( - pytorch_tensor.type() == pytorch_tensor_dlpack.type()) + pb_tensor_upgraded = pb_utils.Tensor.from_dlpack( + "test_tensor", pytorch_tensor + ) + self.assertTrue( + np.array_equal(pb_tensor_upgraded.as_numpy(), pytorch_tensor.numpy()) + ) + + # Here we check that `pb_tensor` as a producer, properly + # invokes `__dlpack__` and `__dlpack_device__` + pytorch_tensor_dlpack = from_dlpack(pb_tensor_upgraded) + self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor)) + + self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type()) def test_non_contiguous_error(self): pytorch_tensor = torch.rand([20, 30], dtype=torch.float16) @@ -70,78 +87,257 @@ def test_non_contiguous_error(self): pytorch_tensor = torch.transpose(pytorch_tensor, 0, 1) with self.assertRaises(Exception) as e: - pb_utils.Tensor.from_dlpack('test_tensor', - to_dlpack(pytorch_tensor)) + pb_utils.Tensor.from_dlpack("test_tensor", to_dlpack(pytorch_tensor)) self.assertTrue( - str(e.exception) == - 'DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported.' + str(e.exception) + == "DLPack tensor is not contiguous. Only contiguous DLPack tensors that are stored in C-Order are supported." ) def test_dlpack_string_tensor(self): - np_object = np.array(['An Example String'], dtype=np.object_) - pb_tensor = pb_utils.Tensor('test_tensor', np_object) + np_object = np.array(["An Example String"], dtype=np.object_) + pb_tensor = pb_utils.Tensor("test_tensor", np_object) with self.assertRaises(Exception) as e: pb_tensor.to_dlpack() self.assertTrue( - str(e.exception) == - 'DLPack does not have support for string tensors.') + str(e.exception) == "DLPack does not have support for string tensors." + ) def test_dlpack_gpu_tensors(self): # Test different dtypes + # PyTorch does not support DLPack bool type yet: + # https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/DLConvertor.cpp pytorch_dtypes = [ - torch.float16, torch.float32, torch.float64, torch.int8, - torch.int16, torch.int32, torch.int64, torch.uint8, torch.bool + torch.float16, + torch.float32, + torch.float64, + torch.int8, + torch.int16, + torch.int32, + torch.int64, + torch.uint8, ] for pytorch_dtype in pytorch_dtypes: - pytorch_tensor = torch.rand( - [100], dtype=torch.float16, device='cuda') * 100 - pytorch_tensor = pytorch_tensor.type(pytorch_dtype) + pytorch_tensor = torch.ones([100], dtype=pytorch_dtype, device="cuda") dlpack_tensor = to_dlpack(pytorch_tensor) - pb_tensor = pb_utils.Tensor.from_dlpack('test_tensor', - dlpack_tensor) + pb_tensor = pb_utils.Tensor.from_dlpack("test_tensor", dlpack_tensor) # Convert the tensor back to DLPack and ensure that both tensors are # the same pytorch_tensor_dlpack = from_dlpack(pb_tensor.to_dlpack()) - self.assertTrue(torch.all(pytorch_tensor_dlpack == pytorch_tensor)) + self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor)) + self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type()) - # DLPack does not properly support bool type: - # https://github.com/google/jax/issues/4719 - if pytorch_dtype != torch.bool: - self.assertTrue( - pytorch_tensor.type() == pytorch_tensor_dlpack.type()) - else: - self.assertFalse( - pytorch_tensor.type() == pytorch_tensor_dlpack.type()) + # Now we make sure that updated DLPack implementation works + # with GPU as well + pb_tensor = pb_utils.Tensor.from_dlpack("test_tensor", pytorch_tensor) + pytorch_tensor_dlpack = from_dlpack(pb_tensor) + self.assertTrue(torch.equal(pytorch_tensor_dlpack, pytorch_tensor)) + self.assertEqual(pytorch_tensor.type(), pytorch_tensor_dlpack.type()) def test_dlpack_gpu_numpy(self): # DLPack tesnors that are in GPU cannot be converted to NumPy - pytorch_tensor = torch.rand([100], dtype=torch.float16, - device='cuda') * 100 - pb_tensor = pb_utils.Tensor.from_dlpack('tensor', - to_dlpack(pytorch_tensor)) + pytorch_tensor = torch.rand([100], dtype=torch.float16, device="cuda") * 100 + pb_tensor = pb_utils.Tensor.from_dlpack("tensor", to_dlpack(pytorch_tensor)) + # Make sure that `__dlpack_device__` works as expected + self.assertFalse(pb_tensor.is_cpu()) + self.assertTrue(pytorch_tensor.is_cuda) + self.assertEqual( + pb_tensor.__dlpack_device__(), pytorch_tensor.__dlpack_device__() + ) + with self.assertRaises(Exception) as e: pb_tensor.as_numpy() self.assertTrue( - str(e.exception) == - 'Tensor is stored in GPU and cannot be converted to NumPy.') + str(e.exception) + == "Tensor is stored in GPU and cannot be converted to NumPy." + ) + + def test_dlpack_cpu_numpy(self): + # Check compatibiity of PbTensor DLPack implementation + # with numpy + pytorch_tensor = torch.rand([100], dtype=torch.float16, device="cpu") * 100 + pb_tensor = pb_utils.Tensor.from_dlpack("tensor", pytorch_tensor) + numpy_tensor_dlpack = np.from_dlpack(pb_tensor) + self.assertTrue(np.array_equal(numpy_tensor_dlpack, pytorch_tensor.numpy())) + # Make sure that `__dlpack_device__` works as expected + self.assertTrue(pb_tensor.is_cpu()) + self.assertFalse(pytorch_tensor.is_cuda) + self.assertEqual( + pb_tensor.__dlpack_device__(), pytorch_tensor.__dlpack_device__() + ) + def test_bool_datatype(self): + # [FIXME] pass bool_array directly to `pb_utils.Tensor.from_dlpack`, + # when numpy release supports DLPack bool type + bool_array = np.asarray([False, True]) + bool_tensor = pb_utils.Tensor("tensor", bool_array) + bool_tensor_dlpack = pb_utils.Tensor.from_dlpack("tensor", bool_tensor) + self.assertTrue(np.array_equal(bool_array, bool_tensor_dlpack.as_numpy())) -class TritonPythonModel: + def test_cuda_multi_stream(self): + # Test that external stream syncs with the default + # and pb_tensor has proper data + size = 5000 + pytorch_tensor_1 = torch.tensor([0, 0, 0, 0], device="cuda") + pytorch_tensor_2 = torch.tensor([0, 0, 0, 0], device="cuda") + expected_output = torch.tensor([2, 2, 2, 2], device="cuda") + s1 = torch.cuda.Stream() + with torch.cuda.stream(s1): + matrix_a = torch.randn(size, size, device="cuda") + res = torch.matmul(matrix_a, matrix_a) + for _ in range(1000): + res = torch.matmul(res, matrix_a) + pytorch_tensor_1 += torch.tensor([2, 2, 2, 2], device="cuda") + pytorch_tensor_2 += torch.tensor([2, 2, 2, 2], device="cuda") + + pb_tensor_1 = pb_utils.Tensor.from_dlpack("tensor", pytorch_tensor_1) + pb_tensor_2 = pb_utils.Tensor.from_dlpack("tensor", to_dlpack(pytorch_tensor_2)) + pytorch_tensor_dlpack = from_dlpack(pb_tensor_1) + self.assertTrue(torch.equal(pytorch_tensor_dlpack, expected_output)) + pytorch_tensor_dlpack = from_dlpack(pb_tensor_2) + self.assertTrue(torch.equal(pytorch_tensor_dlpack, expected_output)) + + def test_cuda_non_blocking_multi_stream(self): + # Test that external non-blocking stream syncs with the default stream + # and pb_tensor has proper data + size = 5000 + cupy_tensor = cp.array([0, 0, 0, 0]) + expected_output = cp.array([2, 2, 2, 2]) + non_blocking_stream = cp.cuda.Stream(non_blocking=True) + with non_blocking_stream: + matrix_a = cp.random.rand(size, size) + res = cp.matmul(matrix_a, matrix_a) + for _ in range(1000): + res = cp.matmul(res, matrix_a) + cupy_tensor += cp.array([2, 2, 2, 2]) + + pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor) + # Verify that non-blocking stream has no pending jobs left + self.assertTrue(non_blocking_stream.done) + cupy_tensor_dlpack = cp.from_dlpack(pb_tensor) + self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output)) + self.assertFalse(pb_tensor.is_cpu()) + self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__()) + + def test_cuda_multi_gpu(self): + # Test that when `pb_utils.Tensor.from_dlpack` is called on different + # GPU from where external tensor is stored, we receive a pointer + # and all pending work on different GPU's default stream + # on external tensor is done + size = 5000 + # DLDeviceType::kDLCUDA, device_id 1 + expected_dlpack_device = (2, 1) + with cp.cuda.Device(1): + expected_output = cp.array([2, 2, 2, 2]) + cupy_tensor = cp.array([0, 0, 0, 0]) + matrix_a = cp.random.rand(size, size) + res = cp.matmul(matrix_a, matrix_a) + for _ in range(1000): + res = cp.matmul(res, matrix_a) + cupy_tensor += cp.array([2, 2, 2, 2]) + with cp.cuda.Device(0): + pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor) + with cp.cuda.Device(1): + # To make sure that the default stream is done with + # all compute work + self.assertTrue(cp.cuda.Stream(null=True).done) + cupy_tensor_dlpack = cp.from_dlpack(pb_tensor) + + with cp.cuda.Device(1): + self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output)) + + self.assertFalse(pb_tensor.is_cpu()) + self.assertEqual(pb_tensor.__dlpack_device__(), expected_dlpack_device) + self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__()) + def test_cuda_blocking_stream_multi_gpu(self): + # Test that when `pb_utils.Tensor.from_dlpack` is called on different + # GPU from where external tensor is stored, we receive a pointer + # and all pending work on different GPU's a blocking stream + # on external tensor is done + size = 5000 + # DLDeviceType::kDLCUDA, device_id 1 + expected_dlpack_device = (2, 1) + with cp.cuda.Device(1): + expected_output = cp.array([2, 2, 2, 2]) + blocking_stream = cp.cuda.Stream(non_blocking=False) + with blocking_stream: + cupy_tensor = cp.array([0, 0, 0, 0]) + matrix_a = cp.random.rand(size, size) + res = cp.matmul(matrix_a, matrix_a) + for _ in range(1000): + res = cp.matmul(res, matrix_a) + cupy_tensor += cp.array([2, 2, 2, 2]) + with cp.cuda.Device(0): + pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor) + with cp.cuda.Device(1): + # To make sure that blocking stream is done with + # all compute work + self.assertTrue(blocking_stream.done) + cupy_tensor_dlpack = cp.from_dlpack(pb_tensor) + + with cp.cuda.Device(1): + self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output)) + + self.assertFalse(pb_tensor.is_cpu()) + self.assertEqual(pb_tensor.__dlpack_device__(), expected_dlpack_device) + self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__()) + + def test_cuda_non_blocking_stream_multi_gpu(self): + # Test that when `pb_utils.Tensor.from_dlpack` is called on different + # GPU from where external tensor is stored, we receive a pointer + # and all pending work on different GPU's non-blocking stream + # on external tensor is done. + # This test seems to be affected by `test_cuda_multi_gpu` + # and `test_cuda_blocking_stream_multi_gpu` if GPUs 0 and 1 are used. + # Thus for this test, we use GPUs 0 and 2 + # JIRA: DLIS-4887 + size = 5000 + # DLDeviceType::kDLCUDA, device_id 1 + expected_dlpack_device = (2, 2) + with cp.cuda.Device(2): + expected_output = cp.array([2, 2, 2, 2]) + non_blocking_stream = cp.cuda.Stream(non_blocking=True) + with non_blocking_stream: + cupy_tensor = cp.array([0, 0, 0, 0]) + matrix_a = cp.random.rand(size, size) + res = cp.matmul(matrix_a, matrix_a) + for _ in range(1000): + res = cp.matmul(res, matrix_a) + cupy_tensor += cp.array([2, 2, 2, 2]) + with cp.cuda.Device(0): + pb_tensor = pb_utils.Tensor.from_dlpack("tensor", cupy_tensor) + with cp.cuda.Device(2): + # To make sure that non_blocking stream is done with + # all compute work + self.assertTrue(non_blocking_stream.done) + cupy_tensor_dlpack = cp.from_dlpack(pb_tensor) + + with cp.cuda.Device(2): + self.assertTrue(cp.array_equal(cupy_tensor_dlpack, expected_output)) + + self.assertFalse(pb_tensor.is_cpu()) + self.assertEqual(pb_tensor.__dlpack_device__(), expected_dlpack_device) + self.assertEqual(pb_tensor.__dlpack_device__(), cupy_tensor.__dlpack_device__()) + + +class TritonPythonModel: def execute(self, requests): responses = [] for _ in requests: # Run the unittest and store the results in InferenceResponse. - test = unittest.main('model', exit=False) + test = unittest.main("model", exit=False) responses.append( - pb_utils.InferenceResponse([ - pb_utils.Tensor( - 'OUTPUT0', - np.array([test.result.wasSuccessful()], - dtype=np.float16)) - ])) + pb_utils.InferenceResponse( + [ + pb_utils.Tensor( + "OUTPUT0", + np.array([test.result.wasSuccessful()], dtype=np.float16), + ) + ] + ) + ) return responses diff --git a/qa/python_models/error_code/config.pbtxt b/qa/python_models/error_code/config.pbtxt new file mode 100644 index 0000000000..90fd5eb1e3 --- /dev/null +++ b/qa/python_models/error_code/config.pbtxt @@ -0,0 +1,47 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "error_code" +backend: "python" +max_batch_size: 4 + +input [ + { + name: "ERROR_CODE" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +output [ + { + name: "DUMMY_OUT" + data_type: TYPE_STRING + dims: [ 1 ] + } +] + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/error_code/model.py b/qa/python_models/error_code/model.py new file mode 100644 index 0000000000..078a4afb73 --- /dev/null +++ b/qa/python_models/error_code/model.py @@ -0,0 +1,59 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def execute(self, requests): + error_code_map = { + "UNKNOWN": pb_utils.TritonError.UNKNOWN, + "INTERNAL": pb_utils.TritonError.INTERNAL, + "NOT_FOUND": pb_utils.TritonError.NOT_FOUND, + "INVALID_ARG": pb_utils.TritonError.INVALID_ARG, + "UNAVAILABLE": pb_utils.TritonError.UNAVAILABLE, + "UNSUPPORTED": pb_utils.TritonError.UNSUPPORTED, + "ALREADY_EXISTS": pb_utils.TritonError.ALREADY_EXISTS, + "CANCELLED": pb_utils.TritonError.CANCELLED, + } + + responses = [] + + for request in requests: + err_code_tensor = pb_utils.get_input_tensor_by_name( + request, "ERROR_CODE" + ).as_numpy() + err_code_str = str(err_code_tensor[0][0], encoding="utf-8") + if err_code_str in error_code_map: + error = pb_utils.TritonError( + message=("error code: " + err_code_str), + code=error_code_map[err_code_str], + ) + else: + error = pb_utils.TritonError("unrecognized error code: " + err_code_str) + responses.append(pb_utils.InferenceResponse(error=error)) + + return responses diff --git a/qa/python_models/execute_cancel/config.pbtxt b/qa/python_models/execute_cancel/config.pbtxt new file mode 100644 index 0000000000..df509863ad --- /dev/null +++ b/qa/python_models/execute_cancel/config.pbtxt @@ -0,0 +1,47 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "execute_cancel" +backend: "python" +max_batch_size: 1 + +input [ + { + name: "EXECUTE_DELAY" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +output [ + { + name: "DUMMY_OUT" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/execute_cancel/model.py b/qa/python_models/execute_cancel/model.py new file mode 100644 index 0000000000..ec7b96ec1a --- /dev/null +++ b/qa/python_models/execute_cancel/model.py @@ -0,0 +1,108 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json +import threading +import time + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + self._logger = pb_utils.Logger + self._model_config = json.loads(args["model_config"]) + self._using_decoupled = pb_utils.using_decoupled_model_transaction_policy( + self._model_config + ) + + def execute(self, requests): + processed_requests = [] + for request in requests: + delay_tensor = pb_utils.get_input_tensor_by_name( + request, "EXECUTE_DELAY" + ).as_numpy() + delay = delay_tensor[0][0] # seconds + if self._using_decoupled: + processed_requests.append( + {"response_sender": request.get_response_sender(), "delay": delay} + ) + else: + processed_requests.append({"request": request, "delay": delay}) + if self._using_decoupled: + return self._execute_decoupled(processed_requests) + return self._execute_processed_requests(processed_requests) + + def _execute_processed_requests(self, processed_requests): + responses = [] + for processed_request in processed_requests: + error = pb_utils.TritonError(message="not cancelled") + object_to_check_cancelled = None + if "response_sender" in processed_request: + object_to_check_cancelled = processed_request["response_sender"] + elif "request" in processed_request: + object_to_check_cancelled = processed_request["request"] + delay = processed_request["delay"] # seconds + time_elapsed = 0.0 # seconds + while time_elapsed < delay: + time.sleep(1) + time_elapsed += 1.0 + if object_to_check_cancelled.is_cancelled(): + self._logger.log_info( + "[execute_cancel] Request cancelled at " + + str(time_elapsed) + + " s" + ) + error = pb_utils.TritonError( + message="cancelled", code=pb_utils.TritonError.CANCELLED + ) + break + self._logger.log_info( + "[execute_cancel] Request not cancelled at " + + str(time_elapsed) + + " s" + ) + responses.append(pb_utils.InferenceResponse(error=error)) + return responses + + def _execute_decoupled(self, processed_requests): + def response_thread(execute_processed_requests, processed_requests): + time.sleep(2) # execute after requests are released + responses = execute_processed_requests(processed_requests) + for i in range(len(responses)): # len(responses) == len(processed_requests) + response_sender = processed_requests[i]["response_sender"] + response_sender.send(responses[i]) + response_sender.send( + flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) + + thread = threading.Thread( + target=response_thread, + args=(self._execute_processed_requests, processed_requests), + ) + thread.daemon = True + thread.start() + return None diff --git a/qa/python_models/execute_error/model.py b/qa/python_models/execute_error/model.py index 2a244e083e..9ecdbff816 100644 --- a/qa/python_models/execute_error/model.py +++ b/qa/python_models/execute_error/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,24 +28,23 @@ class TritonPythonModel: - def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" responses = [] - # Only generate the error for the first request + # Generate the error for the first and third request i = 0 for request in requests: input_tensor = pb_utils.get_input_tensor_by_name(request, "IN") out_tensor = pb_utils.Tensor("OUT", input_tensor.as_numpy()) if i == 0: - error = pb_utils.TritonError( - 'An error occured during execution') - responses.append(pb_utils.InferenceResponse([out_tensor], - error)) - else: + error = pb_utils.TritonError("An error occurred during execution") + responses.append(pb_utils.InferenceResponse([out_tensor], error)) + elif i == 1: responses.append(pb_utils.InferenceResponse([out_tensor])) + elif i == 2: + error = pb_utils.TritonError("An error occurred during execution") + responses.append(pb_utils.InferenceResponse(error=error)) i += 1 return responses diff --git a/qa/python_models/execute_return_error/model.py b/qa/python_models/execute_return_error/model.py index 6e19d68e4a..e304441f04 100644 --- a/qa/python_models/execute_return_error/model.py +++ b/qa/python_models/execute_return_error/model.py @@ -1,4 +1,4 @@ -# Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,11 +24,8 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils - class TritonPythonModel: - def initialize(self, args): self._i = -1 diff --git a/qa/python_models/fini_error/model.py b/qa/python_models/fini_error/model.py index 3f8c1ab5f3..7a9f409aee 100644 --- a/qa/python_models/fini_error/model.py +++ b/qa/python_models/fini_error/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,7 +28,6 @@ class TritonPythonModel: - def execute(self, requests): """ The body of this model doesn't matter. The main purpose of this model is diff --git a/qa/python_models/ground_truth/config.pbtxt b/qa/python_models/ground_truth/config.pbtxt new file mode 100644 index 0000000000..2b7a7d19a2 --- /dev/null +++ b/qa/python_models/ground_truth/config.pbtxt @@ -0,0 +1,52 @@ +# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "ground_truth" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] + +instance_group [ + { + count: 1 + kind : KIND_CPU + } +] diff --git a/qa/python_models/ground_truth/model.py b/qa/python_models/ground_truth/model.py new file mode 100644 index 0000000000..24a286e300 --- /dev/null +++ b/qa/python_models/ground_truth/model.py @@ -0,0 +1,51 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import time + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def execute(self, requests): + """ + Mock Model that uses the input data to determine how long to wait + before returning identity data + """ + assert len(requests) == 1 + delay = 0 + request = requests[0] + responses = [] + + delay_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") + delay_as_numpy = delay_tensor.as_numpy() + delay = float(delay_as_numpy[0][0]) + + out_tensor = pb_utils.Tensor("OUTPUT0", delay_as_numpy) + responses.append(pb_utils.InferenceResponse([out_tensor])) + + time.sleep(delay) + return responses diff --git a/qa/python_models/identity_fp32/model.py b/qa/python_models/identity_fp32/model.py index 4273977263..2161a1e732 100644 --- a/qa/python_models/identity_fp32/model.py +++ b/qa/python_models/identity_fp32/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,7 +28,6 @@ class TritonPythonModel: - def execute(self, requests): """ Identity model in Python backend. diff --git a/qa/python_models/identity_fp32_logging/config.pbtxt b/qa/python_models/identity_fp32_logging/config.pbtxt new file mode 100644 index 0000000000..aaa4a2ee43 --- /dev/null +++ b/qa/python_models/identity_fp32_logging/config.pbtxt @@ -0,0 +1,53 @@ +# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "identity_fp32_logging" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind : KIND_CPU + } +] + diff --git a/qa/python_models/identity_fp32_logging/model.py b/qa/python_models/identity_fp32_logging/model.py new file mode 100644 index 0000000000..91ace61fd5 --- /dev/null +++ b/qa/python_models/identity_fp32_logging/model.py @@ -0,0 +1,72 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + logger = pb_utils.Logger + logger.log("Initialize-Specific Msg!", logger.INFO) + logger.log_info("Initialize-Info Msg!") + logger.log_warn("Initialize-Warning Msg!") + logger.log_error("Initialize-Error Msg!") + logger.log_verbose("Initialize-Verbose Msg!") + + def execute(self, requests): + """ + Identity model in Python backend. + """ + # Log as early as possible + logger = pb_utils.Logger + logger.log("Execute-Specific Msg!", logger.INFO) + logger.log_info("Execute-Info Msg!") + logger.log_warn("Execute-Warning Msg!") + logger.log_error("Execute-Error Msg!") + logger.log_verbose("Execute-Verbose Msg!") + + responses = [] + for request in requests: + input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") + out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy()) + responses.append(pb_utils.InferenceResponse([out_tensor])) + + # Log as late as possible + logger.log("Execute-Specific Msg!", logger.INFO) + logger.log_info("Execute-Info Msg!") + logger.log_warn("Execute-Warning Msg!") + logger.log_error("Execute-Error Msg!") + logger.log_verbose("Execute-Verbose Msg!") + + return responses + + def finalize(self): + logger = pb_utils.Logger + logger.log("Finalize-Specific Msg!", logger.INFO) + logger.log_info("Finalize-Info Msg!") + logger.log_warn("Finalize-Warning Msg!") + logger.log_error("Finalize-Error Msg!") + logger.log_verbose("Finalize-Verbose Msg!") diff --git a/qa/python_models/identity_fp32_timeout/config.pbtxt b/qa/python_models/identity_fp32_timeout/config.pbtxt new file mode 100644 index 0000000000..c14fd8e0a3 --- /dev/null +++ b/qa/python_models/identity_fp32_timeout/config.pbtxt @@ -0,0 +1,60 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "identity_fp32_timeout" +backend: "python" +max_batch_size: 64 + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind : KIND_CPU + } +] + +dynamic_batching { + default_queue_policy { + timeout_action: REJECT + allow_timeout_override: true + default_timeout_microseconds: 1000000 + } +} diff --git a/qa/python_models/identity_fp32_timeout/model.py b/qa/python_models/identity_fp32_timeout/model.py new file mode 100644 index 0000000000..356948e8de --- /dev/null +++ b/qa/python_models/identity_fp32_timeout/model.py @@ -0,0 +1,45 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import time + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def execute(self, requests): + """ + Identity model in Python backend. + """ + logger = pb_utils.Logger + responses = [] + for request in requests: + input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") + out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy()) + logger.log_info(f"Request timeout: {request.timeout()}") + time.sleep(5) + responses.append(pb_utils.InferenceResponse([out_tensor])) + return responses diff --git a/qa/python_models/init_args/model.py b/qa/python_models/init_args/model.py index 2f3d933b79..12dd2212a1 100644 --- a/qa/python_models/init_args/model.py +++ b/qa/python_models/init_args/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,20 +24,39 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +import os + import numpy as np import triton_python_backend_utils as pb_utils -class TritonPythonModel: +def check_init_args(args): + expected_args = { + "model_name": "init_args", + "model_instance_name": "init_args_0_0", + "model_instance_kind": "CPU", + "model_instance_device_id": "0", + "model_repository": os.getenv("TRITON_DIR", "/opt/tritonserver") + + "/qa/L0_backend_python/models/init_args", + "model_version": "1", + } - def initialize(self, args): - self.args = args - if args['model_name'] != 'init_args' or args[ - 'model_instance_name'] != 'init_args_0': + for arg in expected_args: + if args[arg] != expected_args[arg]: raise pb_utils.TritonModelException( - 'model_instance_name/model_name does not contain correct value.' + arg + + ' does not contain correct value. Expected "' + + expected_args[arg] + + ", got " + + args[arg] ) + +class TritonPythonModel: + def initialize(self, args): + self.args = args + check_init_args(self.args) + def execute(self, requests): """ This function counts the number of keys in the @@ -45,9 +64,13 @@ def execute(self, requests): correct. """ keys = [ - 'model_config', 'model_instance_kind', 'model_instance_name', - 'model_instance_device_id', 'model_repository', 'model_version', - 'model_name' + "model_config", + "model_instance_kind", + "model_instance_name", + "model_instance_device_id", + "model_repository", + "model_version", + "model_name", ] correct_keys = 0 @@ -58,6 +81,7 @@ def execute(self, requests): responses = [] for _ in requests: out_args = pb_utils.Tensor( - "OUT", np.array([correct_keys], dtype=np.float32)) + "OUT", np.array([correct_keys], dtype=np.float32) + ) responses.append(pb_utils.InferenceResponse([out_args])) return responses diff --git a/qa/python_models/init_error/model.py b/qa/python_models/init_error/model.py index 11c6a6fb07..654dc8ef2c 100644 --- a/qa/python_models/init_error/model.py +++ b/qa/python_models/init_error/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,9 +28,8 @@ class TritonPythonModel: - def initialize(self, args): - self.model_config = args['model_config'] + self.model_config = args["model_config"] lorem_ipsum def execute(self, requests): diff --git a/qa/python_models/init_exit/config.pbtxt b/qa/python_models/init_exit/config.pbtxt new file mode 100644 index 0000000000..a18aff189d --- /dev/null +++ b/qa/python_models/init_exit/config.pbtxt @@ -0,0 +1,46 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "init_exit" +backend: "python" + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/init_exit/model.py b/qa/python_models/init_exit/model.py new file mode 100644 index 0000000000..e0fc8b55a4 --- /dev/null +++ b/qa/python_models/init_exit/model.py @@ -0,0 +1,40 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import os +import signal +import time + + +class TritonPythonModel: + def initialize(self, args): + time.sleep(3) + # Simulate the case that the model goes out of memory and gets killed + # by the OOM killer + os.kill(os.getpid(), signal.SIGKILL) + + def execute(self, requests): + pass diff --git a/qa/python_models/iterative_sequence/config.pbtxt b/qa/python_models/iterative_sequence/config.pbtxt new file mode 100644 index 0000000000..faa1735718 --- /dev/null +++ b/qa/python_models/iterative_sequence/config.pbtxt @@ -0,0 +1,51 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "iterative_sequence" +backend: "python" +max_batch_size: 0 +model_transaction_policy { + decoupled: True +} +input [ + { + name: "IN" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] +output [ + { + name: "OUT" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] +sequence_batching { + iterative_sequence : true +} + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/iterative_sequence/model.py b/qa/python_models/iterative_sequence/model.py new file mode 100644 index 0000000000..c45f82a607 --- /dev/null +++ b/qa/python_models/iterative_sequence/model.py @@ -0,0 +1,131 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + """ + This model takes 1 input tensor, an INT32 [ 1 ] input named "IN", and + produces an output tensor "OUT" with the same shape as the input tensor. + The input value indicates the total number of responses to be generated and + the output value indicates the number of remaining responses. For example, + if the request input has value 2, the model will: + - Send a response with value 1. + - Release request with RESCHEDULE flag. + - When execute on the same request, send the last response with value 0. + - Release request with ALL flag. + """ + + def initialize(self, args): + self.model_config = model_config = json.loads(args["model_config"]) + + using_decoupled = pb_utils.using_decoupled_model_transaction_policy( + model_config + ) + if not using_decoupled: + raise pb_utils.TritonModelException( + """the model `{}` can generate any number of responses per request, + enable decoupled transaction policy in model configuration to + serve this model""".format( + args["model_name"] + ) + ) + + # Get IN configuration + in_config = pb_utils.get_input_config_by_name(model_config, "IN") + + # Validate the shape and data type of IN + in_shape = in_config["dims"] + if (len(in_shape) != 1) or (in_shape[0] != 1): + raise pb_utils.TritonModelException( + """the model `{}` requires the shape of 'IN' to be + [1], got {}""".format( + args["model_name"], in_shape + ) + ) + if in_config["data_type"] != "TYPE_INT32": + raise pb_utils.TritonModelException( + """the model `{}` requires the data_type of 'IN' to be + 'TYPE_INT32', got {}""".format( + args["model_name"], in_config["data_type"] + ) + ) + + # Get OUT configuration + out_config = pb_utils.get_output_config_by_name(model_config, "OUT") + + # Validate the shape and data type of OUT + out_shape = out_config["dims"] + if (len(out_shape) != 1) or (out_shape[0] != 1): + raise pb_utils.TritonModelException( + """the model `{}` requires the shape of 'OUT' to be + [1], got {}""".format( + args["model_name"], out_shape + ) + ) + if out_config["data_type"] != "TYPE_INT32": + raise pb_utils.TritonModelException( + """the model `{}` requires the data_type of 'OUT' to be + 'TYPE_INT32', got {}""".format( + args["model_name"], out_config["data_type"] + ) + ) + + self.remaining_response = 0 + self.reset_flag = True + + def execute(self, requests): + for request in requests: + in_input = pb_utils.get_input_tensor_by_name(request, "IN").as_numpy() + + if self.reset_flag: + self.remaining_response = in_input[0] + self.reset_flag = False + + response_sender = request.get_response_sender() + + self.remaining_response -= 1 + + out_output = pb_utils.Tensor( + "OUT", np.array([self.remaining_response], np.int32) + ) + response = pb_utils.InferenceResponse(output_tensors=[out_output]) + + if self.remaining_response <= 0: + response_sender.send( + response, flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL + ) + else: + request.set_release_flags( + pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE + ) + response_sender.send(response) + + return None diff --git a/qa/python_models/model_env/model.py b/qa/python_models/model_env/model.py index 0eff470394..8cc9db8d81 100644 --- a/qa/python_models/model_env/model.py +++ b/qa/python_models/model_env/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,17 +25,18 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import os + import triton_python_backend_utils as pb_utils class TritonPythonModel: - def initialize(self, args): # Make sure that environment variables are correctly propagated # to the Python models - if "MY_ENV" not in os.environ or os.environ["MY_ENV"] != 'MY_ENV': + if "MY_ENV" not in os.environ or os.environ["MY_ENV"] != "MY_ENV": raise pb_utils.TritonModelException( - "MY_ENV doesn't exists or contains incorrect value") + "MY_ENV doesn't exists or contains incorrect value" + ) def execute(self, requests): pass diff --git a/qa/python_models/model_init_del/config.pbtxt b/qa/python_models/model_init_del/config.pbtxt new file mode 100644 index 0000000000..be66468a0a --- /dev/null +++ b/qa/python_models/model_init_del/config.pbtxt @@ -0,0 +1,52 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "model_init_del" +backend: "python" +max_batch_size: 0 + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +instance_group [ + { + count: 1 + kind: KIND_CPU + } +] # end instance_group diff --git a/qa/python_models/model_init_del/model.py b/qa/python_models/model_init_del/model.py new file mode 100644 index 0000000000..578279f8ef --- /dev/null +++ b/qa/python_models/model_init_del/model.py @@ -0,0 +1,57 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import os +import sys +import time + +import triton_python_backend_utils as pb_utils + +sys.path.append(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))) +from util import get_delay, inc_count + + +class TritonPythonModel: + def initialize(self, args): + inc_count("initialize") + self._sleep("initialize") + + def execute(self, requests): + responses = [] + for request in requests: + input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") + out_tensor = pb_utils.Tensor("OUTPUT0", input_tensor.as_numpy()) + responses.append(pb_utils.InferenceResponse([out_tensor])) + self._sleep("infer") + return responses + + def finalize(self): + inc_count("finalize") + + def _sleep(self, kind): + delay = get_delay(kind) + if delay > 0: + time.sleep(delay) diff --git a/qa/python_models/model_init_del/util.py b/qa/python_models/model_init_del/util.py new file mode 100755 index 0000000000..a36f13eea9 --- /dev/null +++ b/qa/python_models/model_init_del/util.py @@ -0,0 +1,189 @@ +#!/usr/bin/env python3 + +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import fcntl +import os + +_model_name = "model_init_del" + +# +# Helper functions for reading/writing state to disk +# + + +def _get_number(filename): + full_path = os.path.join(os.environ["MODEL_LOG_DIR"], filename) + try: + with open(full_path, mode="r", encoding="utf-8", errors="strict") as f: + fcntl.lockf(f, fcntl.LOCK_SH) + txt = f.read() + except FileNotFoundError: + txt = "0" + return int(txt) + + +def _store_number(filename, number): + full_path = os.path.join(os.environ["MODEL_LOG_DIR"], filename) + txt = str(number) + with open(full_path, mode="w", encoding="utf-8", errors="strict") as f: + fcntl.lockf(f, fcntl.LOCK_EX) + f.write(txt) + + +def _inc_number(filename): + full_path = os.path.join(os.environ["MODEL_LOG_DIR"], filename) + try: + with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f: + fcntl.lockf(f, fcntl.LOCK_EX) + txt = f.read() + number = int(txt) + 1 + txt = str(number) + f.truncate(0) + f.seek(0) + f.write(txt) + except FileNotFoundError: + number = 1 + _store_number(filename, number) + return number + + +# +# Functions for communicating initialize and finalize count between the model +# and test +# + + +def _get_count_filename(kind): + if kind != "initialize" and kind != "finalize": + raise KeyError("Invalid count kind: " + str(kind)) + filename = _model_name + "_" + kind + "_count.txt" + return filename + + +def get_count(kind): + return _get_number(_get_count_filename(kind)) + + +def inc_count(kind): + return _inc_number(_get_count_filename(kind)) + + +def reset_count(kind): + count = 0 + _store_number(_get_count_filename(kind), count) + return count + + +# +# Functions for communicating varies of delay (in seconds) to the model +# + + +def _get_delay_filename(kind): + if kind != "initialize" and kind != "infer": + raise KeyError("Invalid delay kind: " + str(kind)) + filename = _model_name + "_" + kind + "_delay.txt" + return filename + + +def get_delay(kind): + return _get_number(_get_delay_filename(kind)) + + +def set_delay(kind, delay): + _store_number(_get_delay_filename(kind), delay) + return delay + + +# +# Functions for modifying the model +# + + +def update_instance_group(instance_group_str): + full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt") + with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f: + txt = f.read() + txt, post_match = txt.split("instance_group [") + txt += "instance_group [\n" + txt += instance_group_str + txt += "\n] # end instance_group\n" + txt += post_match.split("\n] # end instance_group\n")[1] + f.truncate(0) + f.seek(0) + f.write(txt) + return txt + + +def update_sequence_batching(sequence_batching_str): + full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt") + with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f: + txt = f.read() + if "sequence_batching {" in txt: + txt, post_match = txt.split("sequence_batching {") + if sequence_batching_str != "": + txt += "sequence_batching {\n" + txt += sequence_batching_str + txt += "\n} # end sequence_batching\n" + txt += post_match.split("\n} # end sequence_batching\n")[1] + elif sequence_batching_str != "": + txt += "\nsequence_batching {\n" + txt += sequence_batching_str + txt += "\n} # end sequence_batching\n" + f.truncate(0) + f.seek(0) + f.write(txt) + return txt + + +def update_model_file(): + full_path = os.path.join(os.path.dirname(__file__), "1", "model.py") + with open(full_path, mode="a", encoding="utf-8", errors="strict") as f: + f.write("\n# dummy model file update\n") + + +def enable_batching(): + full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt") + with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f: + txt = f.read() + txt = txt.replace("max_batch_size: 0", "max_batch_size: 2") + f.truncate(0) + f.seek(0) + f.write(txt) + return txt + + +def disable_batching(): + full_path = os.path.join(os.path.dirname(__file__), "config.pbtxt") + with open(full_path, mode="r+", encoding="utf-8", errors="strict") as f: + txt = f.read() + txt = txt.replace("max_batch_size: 2", "max_batch_size: 0") + f.truncate(0) + f.seek(0) + f.write(txt) + return txt diff --git a/qa/python_models/multi_file/file1.py b/qa/python_models/multi_file/file1.py old mode 100644 new mode 100755 index 3e6706ade9..46b6d76934 --- a/qa/python_models/multi_file/file1.py +++ b/qa/python_models/multi_file/file1.py @@ -1,4 +1,6 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,4 +26,4 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -FILE_NAME = 'FILE1' +FILE_NAME = "FILE1" diff --git a/qa/python_models/multi_file/file2.py b/qa/python_models/multi_file/file2.py old mode 100644 new mode 100755 index 2b73ab0e3d..b7174da748 --- a/qa/python_models/multi_file/file2.py +++ b/qa/python_models/multi_file/file2.py @@ -1,4 +1,6 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +#!/usr/bin/env python3 + +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,4 +26,4 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -FILE_NAME = 'FILE2' +FILE_NAME = "FILE2" diff --git a/qa/python_models/multi_file/model.py b/qa/python_models/multi_file/model.py index a5a55002aa..b94d6f336f 100644 --- a/qa/python_models/multi_file/model.py +++ b/qa/python_models/multi_file/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,16 +25,15 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import file1 -from . import file2 - import triton_python_backend_utils as pb_utils +from . import file2 -class TritonPythonModel: +class TritonPythonModel: def initialize(self, args): - if file1.FILE_NAME != 'FILE1' or file2.FILE_NAME != 'FILE2': - raise pb_utils.TritonModelException('Imports do not work') + if file1.FILE_NAME != "FILE1" or file2.FILE_NAME != "FILE2": + raise pb_utils.TritonModelException("Imports do not work") def execute(self, requests): pass diff --git a/qa/python_models/non_contiguous/model.py b/qa/python_models/non_contiguous/model.py index c8cb4b5570..de7417303b 100644 --- a/qa/python_models/non_contiguous/model.py +++ b/qa/python_models/non_contiguous/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -29,7 +29,6 @@ class TritonPythonModel: - def execute(self, requests): responses = [] new_shape = [10, 2, 6, 5, 11] @@ -40,8 +39,8 @@ def execute(self, requests): output0 = pb_utils.Tensor("OUTPUT0", input_numpy.reshape(new_shape)) # Transpose the tensor to create a non-contiguous tensor. output1 = pb_utils.Tensor("OUTPUT1", input_numpy.T) - output2 = pb_utils.Tensor("OUTPUT2", - np.transpose(input_numpy, shape_reorder)) - responses.append( - pb_utils.InferenceResponse([output0, output1, output2])) + output2 = pb_utils.Tensor( + "OUTPUT2", np.transpose(input_numpy, shape_reorder) + ) + responses.append(pb_utils.InferenceResponse([output0, output1, output2])) return responses diff --git a/qa/python_models/optional/config.pbtxt b/qa/python_models/optional/config.pbtxt index a496e48291..c681ec807f 100644 --- a/qa/python_models/optional/config.pbtxt +++ b/qa/python_models/optional/config.pbtxt @@ -53,10 +53,3 @@ output [ dims: [ 1 ] } ] - -instance_group [ - { - count: 1 - kind : KIND_CPU - } -] diff --git a/qa/python_models/optional/model.py b/qa/python_models/optional/model.py index 8e22d3b492..f0a790b43a 100644 --- a/qa/python_models/optional/model.py +++ b/qa/python_models/optional/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,12 +24,11 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import triton_python_backend_utils as pb_utils import numpy as np +import triton_python_backend_utils as pb_utils class TritonPythonModel: - def execute(self, requests): """Model supporting optional inputs. If the input is not provided, an input tensor of size 1 containing scalar 5 will be used.""" @@ -48,11 +47,10 @@ def execute(self, requests): else: input1_numpy = np.array([5], dtype=np.int32) - output0_tensor = pb_utils.Tensor("OUTPUT0", - input0_numpy + input1_numpy) - output1_tensor = pb_utils.Tensor("OUTPUT1", - input0_numpy - input1_numpy) + output0_tensor = pb_utils.Tensor("OUTPUT0", input0_numpy + input1_numpy) + output1_tensor = pb_utils.Tensor("OUTPUT1", input0_numpy - input1_numpy) responses.append( - pb_utils.InferenceResponse([output0_tensor, output1_tensor])) + pb_utils.InferenceResponse([output0_tensor, output1_tensor]) + ) return responses diff --git a/qa/python_models/python_based_backends/add_sub_backend/model.py b/qa/python_models/python_based_backends/add_sub_backend/model.py new file mode 100644 index 0000000000..7c9736b2d5 --- /dev/null +++ b/qa/python_models/python_based_backends/add_sub_backend/model.py @@ -0,0 +1,162 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json +import os + +import triton_python_backend_utils as pb_utils + +_ADD_SUB_ARGS_FILENAME = "model.json" + + +class TritonPythonModel: + @staticmethod + def auto_complete_config(auto_complete_model_config): + """This function is called only once when loading the model assuming + the server was not started with `--disable-auto-complete-config`. + + Parameters + ---------- + auto_complete_model_config : pb_utils.ModelConfig + An object containing the existing model configuration. + + Returns + ------- + pb_utils.ModelConfig + An object containing the auto-completed model configuration + """ + inputs = [ + {"name": "INPUT0", "data_type": "TYPE_FP32", "dims": [4]}, + {"name": "INPUT1", "data_type": "TYPE_FP32", "dims": [4]}, + ] + outputs = [{"name": "OUTPUT", "data_type": "TYPE_FP32", "dims": [4]}] + + config = auto_complete_model_config.as_dict() + input_names = [] + output_names = [] + + for input in config["input"]: + input_names.append(input["name"]) + + for output in config["output"]: + output_names.append(output["name"]) + + for input in inputs: + if input["name"] not in input_names: + auto_complete_model_config.add_input(input) + + for output in outputs: + if output["name"] not in output_names: + auto_complete_model_config.add_output(output) + + return auto_complete_model_config + + def initialize(self, args): + """This function allows the model to initialize any state associated with this model. + + Parameters + ---------- + args : dict + Both keys and values are strings. The dictionary keys and values are: + * model_config: A JSON string containing the model configuration + * model_instance_kind: A string containing model instance kind + * model_instance_device_id: A string containing model instance device ID + * model_repository: Model repository path + * model_version: Model version + * model_name: Model name + """ + + self.model_config = model_config = json.loads(args["model_config"]) + + # Get OUTPUT configuration + output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT") + + engine_args_filepath = os.path.join( + pb_utils.get_model_dir(), _ADD_SUB_ARGS_FILENAME + ) + assert os.path.isfile( + engine_args_filepath + ), f"'{_ADD_SUB_ARGS_FILENAME}' containing add sub model args must be provided in '{pb_utils.get_model_dir()}'" + + with open(engine_args_filepath) as file: + self.add_sub_config = json.load(file) + + assert ( + "operation" in self.add_sub_config + ), f"Missing required key 'operation' in {_ADD_SUB_ARGS_FILENAME}" + + extra_keys = set(self.add_sub_config.keys()) - {"operation"} + assert ( + not extra_keys + ), f"Unsupported keys are provided in {_ADD_SUB_ARGS_FILENAME}: {', '.join(extra_keys)}" + + assert self.add_sub_config["operation"] in [ + "add", + "sub", + ], f"'operation' value must be 'add' or 'sub' in {_ADD_SUB_ARGS_FILENAME}" + + # Convert Triton types to numpy types + self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + + def execute(self, requests): + """This function is called when an inference request is made + for this model. + + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + + responses = [] + + for request in requests: + in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") + in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") + + if self.add_sub_config["operation"] == "add": + out = in_0.as_numpy() + in_1.as_numpy() + else: + out = in_0.as_numpy() - in_1.as_numpy() + + # Create output tensors. + out_tensor = pb_utils.Tensor("OUTPUT", out.astype(self.output_dtype)) + + # Create InferenceResponse. + inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor]) + responses.append(inference_response) + + return responses + + def finalize(self): + """`finalize` is called only once when the model is being unloaded.""" + print("Cleaning up...") diff --git a/qa/python_models/python_version/model.py b/qa/python_models/python_version/model.py index ee358ffc55..5d77906fa9 100644 --- a/qa/python_models/python_version/model.py +++ b/qa/python_models/python_version/model.py @@ -1,4 +1,4 @@ -# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,18 +24,19 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import sys +import locale import os +import sys + +import numpy as np import triton_python_backend_utils as pb_utils class TritonPythonModel: - @staticmethod def auto_complete_config(auto_complete_model_config): - input = {'name': 'INPUT', 'data_type': 'TYPE_FP32', 'dims': [1]} - output = {'name': 'OUTPUT', 'data_type': 'TYPE_FP32', 'dims': [1]} + input = {"name": "INPUT", "data_type": "TYPE_FP32", "dims": [1]} + output = {"name": "OUTPUT", "data_type": "TYPE_FP32", "dims": [1]} auto_complete_model_config.set_max_batch_size(0) auto_complete_model_config.add_input(input) @@ -45,19 +46,21 @@ def auto_complete_config(auto_complete_model_config): def initialize(self, args): import tensorflow - self.model_config = args['model_config'] + + self.model_config = args["model_config"] # This is to make sure that /bin/bash is not picking up # the wrong shared libraries after installing Tensorflow. # Tensorflow uses a shared library which is common with # bash. - os.system('/bin/bash --help') + os.system("/bin/bash --help") print( - f'Python version is {sys.version_info.major}.{sys.version_info.minor}, NumPy version is {np.version.version}, and Tensorflow version is {tensorflow.__version__}', - flush=True) + f"Python version is {sys.version_info.major}.{sys.version_info.minor}, NumPy version is {np.version.version}, and Tensorflow version is {tensorflow.__version__}", + flush=True, + ) + print(f"Locale is {locale.getlocale()}", flush=True) def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" responses = [] for request in requests: input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") diff --git a/qa/python_models/pytorch_fp32_fp32/model.py b/qa/python_models/pytorch_fp32_fp32/model.py index 4f11d3c726..98269213b2 100644 --- a/qa/python_models/pytorch_fp32_fp32/model.py +++ b/qa/python_models/pytorch_fp32_fp32/model.py @@ -1,4 +1,4 @@ -# Copyright 2020-2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -25,16 +25,13 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. import numpy as np - import torch import torch.nn as nn import torch.nn.functional as F - import triton_python_backend_utils as pb_utils class Net(nn.Module): - def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 32, 3, 1) @@ -61,7 +58,6 @@ def forward(self, x): class TritonPythonModel: - def initialize(self, args): torch.manual_seed(0) self.model = Net() diff --git a/qa/python_models/request_rescheduling_addsub/config.pbtxt b/qa/python_models/request_rescheduling_addsub/config.pbtxt new file mode 100644 index 0000000000..7667bfb3c0 --- /dev/null +++ b/qa/python_models/request_rescheduling_addsub/config.pbtxt @@ -0,0 +1,61 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "request_rescheduling_addsub" +backend: "python" + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] +input [ + { + name: "INPUT1" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] +output [ + { + name: "OUTPUT1" + data_type: TYPE_FP32 + dims: [ 16 ] + } +] +sequence_batching { + iterative_sequence : true +} +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/request_rescheduling_addsub/model.py b/qa/python_models/request_rescheduling_addsub/model.py new file mode 100644 index 0000000000..fb7b0ac9c7 --- /dev/null +++ b/qa/python_models/request_rescheduling_addsub/model.py @@ -0,0 +1,82 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + self.model_config = model_config = json.loads(args["model_config"]) + + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") + + self.output0_dtype = pb_utils.triton_string_to_numpy( + output0_config["data_type"] + ) + self.output1_dtype = pb_utils.triton_string_to_numpy( + output1_config["data_type"] + ) + + self.idx = 0 + + def execute(self, requests): + """This function is called on inference request.""" + + output0_dtype = self.output0_dtype + output1_dtype = self.output1_dtype + + responses = [] + + for request in requests: + in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") + in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") + + out_0, out_1 = ( + in_0.as_numpy() + in_1.as_numpy(), + in_0.as_numpy() - in_1.as_numpy(), + ) + + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype)) + + inference_response = pb_utils.InferenceResponse( + output_tensors=[out_tensor_0, out_tensor_1] + ) + + # Explicitly reschedule the first request + if self.idx == 0: + request.set_release_flags( + pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE + ) + responses.append(None) + self.idx += 1 + else: + responses.append(inference_response) + + return responses diff --git a/qa/python_models/response_sender_error/model.py b/qa/python_models/response_sender_error/model.py index eef186e9d4..4f1e0e5e85 100644 --- a/qa/python_models/response_sender_error/model.py +++ b/qa/python_models/response_sender_error/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,32 +24,32 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np import json + import triton_python_backend_utils as pb_utils class TritonPythonModel: - """ This model tries to create a response sender in + """This model tries to create a response sender in a model that is not configured with decoupled model transaction policy. """ def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) def execute(self, requests): - """ Tries to create a response sender object and use that + """Tries to create a response sender object and use that for sending the response. """ @@ -60,15 +60,16 @@ def execute(self, requests): response_sender = request.get_response_sender() in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") - out_0, out_1 = (in_0.as_numpy() + in_1.as_numpy(), - in_0.as_numpy() - in_1.as_numpy()) + out_0, out_1 = ( + in_0.as_numpy() + in_1.as_numpy(), + in_0.as_numpy() - in_1.as_numpy(), + ) - out_tensor_0 = pb_utils.Tensor("OUTPUT0", - out_0.astype(output0_dtype)) - out_tensor_1 = pb_utils.Tensor("OUTPUT1", - out_1.astype(output1_dtype)) + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype)) response_sender.send( - pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) + pb_utils.InferenceResponse([out_tensor_0, out_tensor_1]) + ) response_sender.close() return None diff --git a/qa/python_models/sequence_int32/config.pbtxt b/qa/python_models/sequence_int32/config.pbtxt new file mode 100644 index 0000000000..fb9236b347 --- /dev/null +++ b/qa/python_models/sequence_int32/config.pbtxt @@ -0,0 +1,80 @@ +# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "python_nobatch_sequence_int32" +backend: "python" +max_batch_size: 0 +version_policy: { latest { num_versions: 1 }} + + +instance_group [ + { + kind: KIND_GPU +count: 4 + } +] + + +input [ + { + name: "INPUT" + data_type: TYPE_INT32 + dims: [ 1 ] + + } +] +output [ + { + name: "OUTPUT" + data_type: TYPE_INT32 + dims: [ 1 ] + + + } +] +sequence_batching { + max_sequence_idle_microseconds: 5000000 + control_input [ + { + name: "START" + control [ + { + kind: CONTROL_SEQUENCE_START + int32_false_true: [ 0, 1 ] + } + ] + }, + { + name: "READY" + control [ + { + kind: CONTROL_SEQUENCE_READY + int32_false_true: [ 0, 1 ] + } + ] + } + ] +} diff --git a/qa/python_models/sequence_int32/model.py b/qa/python_models/sequence_int32/model.py new file mode 100644 index 0000000000..445cb5b13e --- /dev/null +++ b/qa/python_models/sequence_int32/model.py @@ -0,0 +1,92 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + self.model_config = model_config = json.loads(args["model_config"]) + + output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT") + + self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"]) + + self.accumulator = np.zeros(1) + self.max_batch_size = model_config["max_batch_size"] + + def execute(self, requests): + """ + This function is called on inference request. + It is derived from "create_tf_modelfile" in + common/gen_qa_sequence_models.py and maintains + a true accumulator when the max batch size is 0 + + """ + output_dtype = self.output_dtype + + responses = [] + for request in requests: + input_tensor = ( + pb_utils.get_input_tensor_by_name(request, "INPUT") + .as_numpy() + .astype(np.int32) + ) + start_tensor = ( + pb_utils.get_input_tensor_by_name(request, "START") + .as_numpy() + .astype(np.int32) + ) + ready_tensor = ( + pb_utils.get_input_tensor_by_name(request, "READY") + .as_numpy() + .astype(np.int32) + ) + + if self.max_batch_size == 0: + tmp = np.where( + np.equal(start_tensor, 1), + input_tensor, + np.add(self.accumulator, input_tensor), + ) + newacc = np.where(np.equal(ready_tensor, 1), tmp, self.accumulator) + self.accumulator = newacc + out_tensor = pb_utils.Tensor( + "OUTPUT", self.accumulator.astype(output_dtype) + ) + else: + tmp = np.where( + np.equal(ready_tensor, 1), + np.add(start_tensor, input_tensor), + np.zeros(np.shape(input_tensor), dtype=output_dtype), + ) + out_tensor = pb_utils.Tensor("OUTPUT", tmp.astype(output_dtype)) + + responses.append(pb_utils.InferenceResponse([out_tensor])) + return responses diff --git a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml b/qa/python_models/sequence_py/config.pbtxt similarity index 76% rename from deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml rename to qa/python_models/sequence_py/config.pbtxt index 32e65836a7..b58796058d 100644 --- a/deploy/gke-marketplace-app/server-deployer/chart/triton/templates/istio-vs.yaml +++ b/qa/python_models/sequence_py/config.pbtxt @@ -1,4 +1,4 @@ -# Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,22 +24,30 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -apiVersion: networking.istio.io/v1alpha3 -kind: VirtualService -metadata: - name: triton-vs -spec: - hosts: - - "*" - gateways: - - triton-gateway - http: - - route: - - destination: - host: {{ template "triton-inference-server.name" . }} - port: - {{ if eq .Values.tritonProtocol "gRPC" }} - number: 8001 - {{ else }} - number: 8000 - {{ end }} +backend: "python" +max_batch_size: 4 + +input [ + { + name: "INPUT0" + data_type: TYPE_INT32 + dims: [ 1 ] + + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_INT32 + dims: [ 1 ] + } +] + +sequence_batching { + oldest { + max_candidate_sequences: 4 + max_queue_delay_microseconds: 1000000 + preserve_ordering: False + } + max_sequence_idle_microseconds: 10000000 +} diff --git a/qa/python_models/sequence_py/model.py b/qa/python_models/sequence_py/model.py new file mode 100644 index 0000000000..b375af3e30 --- /dev/null +++ b/qa/python_models/sequence_py/model.py @@ -0,0 +1,93 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import numpy as np +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + self.model_config = json.loads(args["model_config"]) + self.sequences = {} + self.decoupled = self.model_config.get("model_transaction_policy", {}).get( + "decoupled" + ) + + def get_next_sequence_output_tensor(self, request): + sid = request.correlation_id() + flags = request.flags() + if flags == pb_utils.TRITONSERVER_REQUEST_FLAG_SEQUENCE_START: + if sid in self.sequences: + raise pb_utils.TritonModelException( + "Can't start a new sequence with existing ID" + ) + self.sequences[sid] = [1] + else: + if sid not in self.sequences: + raise pb_utils.TritonModelException( + "Need START flag for a sequence ID that doesn't already exist." + ) + + last = self.sequences[sid][-1] + self.sequences[sid].append(last + 1) + + output = self.sequences[sid][-1] + output = np.array([output]) + out_tensor = pb_utils.Tensor("OUTPUT0", output.astype(np.int32)) + return out_tensor + + def execute(self, requests): + if self.decoupled: + return self.execute_decoupled(requests) + else: + return self.execute_non_decoupled(requests) + + def execute_non_decoupled(self, requests): + responses = [] + for request in requests: + output_tensor = self.get_next_sequence_output_tensor(request) + response = pb_utils.InferenceResponse([output_tensor]) + responses.append(response) + return responses + + def execute_decoupled(self, requests): + for request in requests: + sender = request.get_response_sender() + output_tensor = self.get_next_sequence_output_tensor(request) + + # Send 3 responses per request + for _ in range(3): + response = pb_utils.InferenceResponse([output_tensor]) + sender.send(response) + + sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) + + return None + + def finalize(self): + print(f"Cleaning up. Final sequences stored: {self.sequences}") diff --git a/qa/python_models/string/model.py b/qa/python_models/string/model.py index 1fd5aece6e..5e419d965a 100644 --- a/qa/python_models/string/model.py +++ b/qa/python_models/string/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -35,15 +35,15 @@ class TritonPythonModel: def initialize(self, args): self._index = 0 - self._dtypes = [np.bytes_, np.object_, np.object] + self._dtypes = [np.bytes_, np.object_] def execute(self, requests): responses = [] for request in requests: in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") out_tensor_0 = pb_utils.Tensor( - "OUTPUT0", - in_0.as_numpy().astype(self._dtypes[self._index])) + "OUTPUT0", in_0.as_numpy().astype(self._dtypes[self._index]) + ) self._index += 1 responses.append(pb_utils.InferenceResponse([out_tensor_0])) return responses diff --git a/qa/python_models/string_fixed/model.py b/qa/python_models/string_fixed/model.py index d1aed94be3..d6e23eccb8 100644 --- a/qa/python_models/string_fixed/model.py +++ b/qa/python_models/string_fixed/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -35,21 +35,29 @@ class TritonPythonModel: def initialize(self, args): self._index = 0 - self._dtypes = [np.bytes_, np.object_, np.object] + self._dtypes = [np.bytes_, np.object_] def execute(self, requests): + # Create four different responses (empty string or fixed string) * (two + # datatypes) responses = [] for _ in requests: - if self._index % 2 == 0: + if self._index == 0: out_tensor_0 = pb_utils.Tensor( - "OUTPUT0", - np.array(['123456'], dtype=self._dtypes[self._index % 3])) - else: - # Test sending strings with no elements + "OUTPUT0", np.array(["123456"], dtype=self._dtypes[0]) + ) + elif self._index == 1: out_tensor_0 = pb_utils.Tensor( - "OUTPUT0", np.array([], - dtype=self._dtypes[self._index % 3])) - + "OUTPUT0", np.array([], dtype=self._dtypes[1]) + ) + elif self._index == 2: + out_tensor_0 = pb_utils.Tensor( + "OUTPUT0", np.array(["123456"], dtype=self._dtypes[0]) + ) + elif self._index == 3: + out_tensor_0 = pb_utils.Tensor( + "OUTPUT0", np.array([], dtype=self._dtypes[1]) + ) self._index += 1 responses.append(pb_utils.InferenceResponse([out_tensor_0])) return responses diff --git a/qa/python_models/string_identity/model.py b/qa/python_models/string_identity/model.py index 39575c119b..0288b129bc 100644 --- a/qa/python_models/string_identity/model.py +++ b/qa/python_models/string_identity/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,23 +24,21 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import sys import json +import sys -sys.path.append('../../') +sys.path.append("../../") import triton_python_backend_utils as pb_utils class TritonPythonModel: - """This model always returns the input that it has received. - """ + """This model always returns the input that it has received.""" def initialize(self, args): - self.model_config = json.loads(args['model_config']) + self.model_config = json.loads(args["model_config"]) def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" responses = [] for request in requests: diff --git a/qa/python_models/sub_add/model.py b/qa/python_models/sub_add/model.py index 0a53874629..8ac679c86f 100644 --- a/qa/python_models/sub_add/model.py +++ b/qa/python_models/sub_add/model.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -24,32 +24,31 @@ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -import numpy as np -import sys import json +import sys -sys.path.append('../../') +import numpy as np + +sys.path.append("../../") import triton_python_backend_utils as pb_utils class TritonPythonModel: - def initialize(self, args): - self.model_config = model_config = json.loads(args['model_config']) + self.model_config = model_config = json.loads(args["model_config"]) - output0_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT0") - output1_config = pb_utils.get_output_config_by_name( - model_config, "OUTPUT1") + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + output1_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT1") self.output0_dtype = pb_utils.triton_string_to_numpy( - output0_config['data_type']) + output0_config["data_type"] + ) self.output1_dtype = pb_utils.triton_string_to_numpy( - output1_config['data_type']) + output1_config["data_type"] + ) def execute(self, requests): - """ This function is called on inference request. - """ + """This function is called on inference request.""" output0_dtype = self.output0_dtype output1_dtype = self.output1_dtype @@ -59,18 +58,21 @@ def execute(self, requests): input_tensors = request.inputs() in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") in_1 = pb_utils.get_input_tensor_by_name(request, "INPUT1") - if in_0.as_numpy().dtype.type is np.bytes_ or in_0.as_numpy( - ).dtype == np.object_: - out_0, out_1 = (in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32),\ - in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32)) + if ( + in_0.as_numpy().dtype.type is np.bytes_ + or in_0.as_numpy().dtype == np.object_ + ): + out_0, out_1 = ( + in_0.as_numpy().astype(np.int32) - in_1.as_numpy().astype(np.int32), + in_0.as_numpy().astype(np.int32) + in_1.as_numpy().astype(np.int32), + ) else: - out_0, out_1 = (in_0.as_numpy() - in_1.as_numpy(), - in_0.as_numpy() + in_1.as_numpy()) + out_0, out_1 = ( + in_0.as_numpy() - in_1.as_numpy(), + in_0.as_numpy() + in_1.as_numpy(), + ) - out_tensor_0 = pb_utils.Tensor("OUTPUT0", - out_0.astype(output0_dtype)) - out_tensor_1 = pb_utils.Tensor("OUTPUT1", - out_1.astype(output1_dtype)) - responses.append( - pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) + out_tensor_1 = pb_utils.Tensor("OUTPUT1", out_1.astype(output1_dtype)) + responses.append(pb_utils.InferenceResponse([out_tensor_0, out_tensor_1])) return responses diff --git a/qa/python_models/torchvision/resnet50/config.pbtxt b/qa/python_models/torchvision/resnet50/config.pbtxt new file mode 100644 index 0000000000..fdbc7c7de9 --- /dev/null +++ b/qa/python_models/torchvision/resnet50/config.pbtxt @@ -0,0 +1,40 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "resnet50_python" +backend: "python" +max_batch_size: 128 +input { + name: "INPUT0" + data_type: TYPE_FP32 + format: FORMAT_NCHW + dims: [ 3, 224, 224 ] + } +output { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 1000 ] + } diff --git a/qa/python_models/torchvision/resnet50/model.py b/qa/python_models/torchvision/resnet50/model.py new file mode 100644 index 0000000000..1e2dbbf7a1 --- /dev/null +++ b/qa/python_models/torchvision/resnet50/model.py @@ -0,0 +1,62 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import torch +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import to_dlpack + + +class TritonPythonModel: + def initialize(self, args): + """ + This function initializes pre-trained ResNet50 model. + """ + self.device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu" + # Our tests currently depend on torchvision=0.14, + # to make sure `torch.hub` loads Resnet50 implementation + # compatible with torchvision=0.14, we need to provide tag + self.model = ( + torch.hub.load( + "pytorch/vision:v0.14.1", "resnet50", weights="IMAGENET1K_V2" + ) + .to(self.device) + .eval() + ) + + def execute(self, requests): + """ + This function receives a list of requests (`pb_utils.InferenceRequest`), + performs inference on every request and appends it to responses. + """ + responses = [] + for request in requests: + input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT0") + result = self.model( + torch.as_tensor(input_tensor.as_numpy(), device=self.device) + ) + out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT0", to_dlpack(result)) + responses.append(pb_utils.InferenceResponse([out_tensor])) + return responses diff --git a/qa/python_models/variable_gpu_output/config.pbtxt b/qa/python_models/variable_gpu_output/config.pbtxt new file mode 100644 index 0000000000..8fe69444f7 --- /dev/null +++ b/qa/python_models/variable_gpu_output/config.pbtxt @@ -0,0 +1,55 @@ +# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "variable_gpu_output" +backend: "python" +max_batch_size: 256 + +input [ + { + name: "INPUT" + data_type: TYPE_FP32 + dims: [ 1 ] + } +] +output [ + { + name: "OUTPUT" + data_type: TYPE_FP32 + dims: [ -1 ] + } +] + +dynamic_batching { + max_queue_delay_microseconds: 1000000 +} + +instance_group [ + { + count: 1 + kind: KIND_GPU + } +] diff --git a/qa/python_models/variable_gpu_output/model.py b/qa/python_models/variable_gpu_output/model.py new file mode 100644 index 0000000000..2da2a3cbd2 --- /dev/null +++ b/qa/python_models/variable_gpu_output/model.py @@ -0,0 +1,46 @@ +# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import torch +import triton_python_backend_utils as pb_utils +from torch.utils.dlpack import to_dlpack + + +class TritonPythonModel: + def execute(self, requests): + # The client will send 5 requests + assert len(requests) == 5 + responses = [] + for i, request in enumerate(requests): + # Create an (i+1)-element array with all the tensors equal to (i+1) + output = torch.ones(i + 1, dtype=torch.float32, device="cuda") + output = output * (i + 1) + output_pb_tensor = pb_utils.Tensor.from_dlpack("OUTPUT", to_dlpack(output)) + inference_response = pb_utils.InferenceResponse( + output_tensors=[output_pb_tensor] + ) + responses.append(inference_response) + return responses diff --git a/qa/python_models/wrong_model/model.py b/qa/python_models/wrong_model/model.py index 9059255395..2cac72324f 100644 --- a/qa/python_models/wrong_model/model.py +++ b/qa/python_models/wrong_model/model.py @@ -1,4 +1,4 @@ -# Copyright 2020-2021, NVIDIA CORPORATION. All rights reserved. +# Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -28,7 +28,6 @@ class TritonPythonModel: - def execute(self, requests): """ This model ensures that errors in the execute function are properly diff --git a/qa/python_models/wrong_return_type/config.pbtxt b/qa/python_models/wrong_return_type/config.pbtxt new file mode 100644 index 0000000000..e34905e635 --- /dev/null +++ b/qa/python_models/wrong_return_type/config.pbtxt @@ -0,0 +1,49 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +name: "wrong_return_type" +backend: "python" + +input [ + { + name: "INPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] +output [ + { + name: "OUTPUT0" + data_type: TYPE_FP32 + dims: [ 4 ] + } +] + +sequence_batching { + iterative_sequence : true +} + +instance_group [{ kind: KIND_CPU }] diff --git a/qa/python_models/wrong_return_type/model.py b/qa/python_models/wrong_return_type/model.py new file mode 100644 index 0000000000..c5e6f660fc --- /dev/null +++ b/qa/python_models/wrong_return_type/model.py @@ -0,0 +1,67 @@ +# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json + +import triton_python_backend_utils as pb_utils + + +class TritonPythonModel: + def initialize(self, args): + self.model_config = model_config = json.loads(args["model_config"]) + + output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT0") + + self.output0_dtype = pb_utils.triton_string_to_numpy( + output0_config["data_type"] + ) + + def execute(self, requests): + output0_dtype = self.output0_dtype + + responses = [] + + for request in requests: + in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0") + + out_0 = in_0.as_numpy() + + # Create output tensors. You need pb_utils.Tensor + # objects to create pb_utils.InferenceResponse. + out_tensor_0 = pb_utils.Tensor("OUTPUT0", out_0.astype(output0_dtype)) + + inference_response = pb_utils.InferenceResponse( + output_tensors=[out_tensor_0] + ) + + request.set_release_flags(pb_utils.TRITONSERVER_REQUEST_RELEASE_RESCHEDULE) + # Should append `None` for rescheduled requests. + responses.append(inference_response) + + return responses + + def finalize(self): + pass diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index d17392b869..f64894c5cd 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -1,4 +1,4 @@ -# Copyright 2019-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright 2019-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions @@ -68,13 +68,6 @@ if(${TRITON_ENABLE_GPU}) message(STATUS "Using CUDA ${CUDA_VERSION}") endif() # TRITON_ENABLE_GPU -# GRPC -# -if(${TRITON_ENABLE_GRPC}) - find_package(gRPC CONFIG REQUIRED) - message(STATUS "Using gRPC ${gRPC_VERSION}") -endif() - # libevent # if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR @@ -83,6 +76,16 @@ if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR message(STATUS "Using libevent ${Libevent_VERSION}") endif() +# OpenTelemetry +# +if (NOT WIN32 AND ${TRITON_ENABLE_TRACING}) + find_package(absl CONFIG REQUIRED) + find_package(CURL CONFIG REQUIRED) + find_package(nlohmann_json CONFIG REQUIRED) + find_package(opentelemetry-cpp CONFIG REQUIRED) + message(STATUS "Using opentelemetry-cpp ${opentelemetry-cpp_VERSION}") +endif() + # re2 # find_library(RE2_LIBRARY NAMES re2) @@ -93,6 +96,7 @@ find_library(RE2_LIBRARY NAMES re2) add_executable( main classification.cc + command_line_parser.cc common.cc main.cc shared_memory_manager.cc @@ -121,8 +125,11 @@ if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC") target_compile_options( main PRIVATE - /W1 /D_WIN32_WINNT=0x0A00 /EHsc + /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor ) + target_compile_definitions(main + PRIVATE + NOMINMAX) else() target_compile_options( main @@ -185,18 +192,6 @@ if(${TRITON_ENABLE_HTTP} OR ${TRITON_ENABLE_METRICS} OR ) endif() -if(${TRITON_ENABLE_GRPC}) - target_include_directories( - main - PRIVATE - $ - ) - - target_compile_definitions( - main - PRIVATE TRITON_ENABLE_GRPC=1 - ) -endif() # TRITON_ENABLE_GRPC if(${TRITON_ENABLE_HTTP}) target_compile_definitions( @@ -245,6 +240,14 @@ if(${TRITON_ENABLE_TRACING}) main PRIVATE TRITON_ENABLE_TRACING=1 ) +# FIXME: remove, when Windows support is added for Opentelemetry + if (NOT WIN32) + target_include_directories( + main + PRIVATE + ${OPENTELEMETRY_CPP_INCLUDE_DIRS} + ) + endif() endif() # TRITON_ENABLE_TRACING if(${TRITON_ENABLE_NVTX}) @@ -278,116 +281,31 @@ else() ) endif() -# grpc endpoint -# if(${TRITON_ENABLE_GRPC}) - add_library( - grpc-endpoint-library EXCLUDE_FROM_ALL - grpc_server.cc - grpc_server.h - ) - - target_compile_features(grpc-endpoint-library PRIVATE cxx_std_11) - if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC") - target_compile_options( - grpc-endpoint-library - PRIVATE - /W1 /D_WIN32_WINNT=0x0A00 /EHsc - ) - else() - target_compile_options( - grpc-endpoint-library - PRIVATE - -Wall -Wextra -Wno-unused-parameter -Wno-deprecated-declarations -Werror - ) - endif() - - set_target_properties( - grpc-endpoint-library - PROPERTIES - POSITION_INDEPENDENT_CODE ON - ) + # + # GRPC + # + find_package(gRPC CONFIG REQUIRED) + message(STATUS "Using gRPC ${gRPC_VERSION}") + add_subdirectory(grpc) target_link_libraries( - grpc-endpoint-library - PUBLIC - proto-library # from repo-common - triton-common-logging # from repo-common - triton-common-json # from repo-common - grpc-service-library # from repo-common - triton-core-serverapi # from repo-core - triton-core-serverstub # from repo-core - gRPC::grpc++ - gRPC::grpc - protobuf::libprotobuf + main + PRIVATE + grpc-endpoint-library ) target_include_directories( - grpc-endpoint-library + main PRIVATE $ ) target_compile_definitions( - grpc-endpoint-library - PRIVATE TRITON_ENABLE_GRPC=1 - ) - - if(${TRITON_ENABLE_GPU}) - target_compile_definitions( - grpc-endpoint-library - PRIVATE TRITON_ENABLE_GPU=1 - PRIVATE TRITON_MIN_COMPUTE_CAPABILITY=${TRITON_MIN_COMPUTE_CAPABILITY} - ) - - target_link_libraries( - grpc-endpoint-library - PUBLIC - CUDA::cudart - ) - endif() # TRITON_ENABLE_GPU - - if(${TRITON_ENABLE_METRICS}) - target_compile_definitions( - grpc-endpoint-library - PRIVATE TRITON_ENABLE_METRICS=1 - ) - endif() # TRITON_ENABLE_METRICS - - if(${TRITON_ENABLE_LOGGING}) - target_compile_definitions( - grpc-endpoint-library - PRIVATE TRITON_ENABLE_LOGGING=1 - ) - endif() # TRITON_ENABLE_LOGGING - - if(${TRITON_ENABLE_STATS}) - target_compile_definitions( - grpc-endpoint-library - PRIVATE TRITON_ENABLE_STATS=1 - ) - endif() # TRITON_ENABLE_STATS - - if(${TRITON_ENABLE_TRACING}) - target_compile_definitions( - grpc-endpoint-library - PRIVATE TRITON_ENABLE_TRACING=1 - ) - endif() # TRITON_ENABLE_TRACING - - if(${TRITON_ENABLE_NVTX}) - target_compile_definitions( - grpc-endpoint-library - PRIVATE TRITON_ENABLE_NVTX=1 - ) - endif() # TRITON_ENABLE_NVTX - -target_link_libraries( main - PRIVATE - grpc-endpoint-library + PRIVATE TRITON_ENABLE_GRPC=1 ) -endif() # TRITON_ENABLE_GRPC +endif() # http endpoint # @@ -440,7 +358,7 @@ if(${TRITON_ENABLE_HTTP} target_compile_options( http-endpoint-library PRIVATE - /W1 /D_WIN32_WINNT=0x0A00 /EHsc + /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor ) else() target_compile_options( @@ -473,6 +391,16 @@ if(${TRITON_ENABLE_HTTP} PRIVATE $ ) + # FIXME when Triton support of Opentelemetry is available on Windows + # add ${OPENTELEMETRY_CPP_INCLUDE_DIRS} to above target_include_directories + # JIRA DLIS-4786 + if (NOT WIN32 AND ${TRITON_ENABLE_TRACING}) + target_include_directories( + http-endpoint-library + PRIVATE ${OPENTELEMETRY_CPP_INCLUDE_DIRS} + ) + endif() + if(${TRITON_ENABLE_GPU}) target_compile_definitions( http-endpoint-library @@ -579,6 +507,20 @@ if(${TRITON_ENABLE_TRACING}) tracer.cc tracer.h ) + if (NOT WIN32) + target_compile_features(tracing-library PRIVATE cxx_std_17) + + target_include_directories( + tracing-library + PRIVATE ${OPENTELEMETRY_CPP_INCLUDE_DIRS} + ) + + target_link_libraries( + tracing-library + PRIVATE + ${OPENTELEMETRY_CPP_LIBRARIES}) + endif() + target_link_libraries( tracing-library PUBLIC @@ -658,7 +600,7 @@ if (NOT WIN32) target_compile_options( simple PRIVATE - /W1 /D_WIN32_WINNT=0x0A00 /EHsc + /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor ) else() target_compile_options( @@ -722,7 +664,7 @@ if (NOT WIN32) target_compile_options( multi_server PRIVATE - /W1 /D_WIN32_WINNT=0x0A00 /EHsc + /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor ) else() target_compile_options( @@ -787,7 +729,7 @@ if (NOT WIN32) target_compile_options( memory_alloc PRIVATE - /W1 /D_WIN32_WINNT=0x0A00 /EHsc + /W1 /D_WIN32_WINNT=0x0A00 /EHsc /Zc:preprocessor ) else() target_compile_options( @@ -831,6 +773,6 @@ if (NOT WIN32) endif() # NOT WIN32 # Currently unit tests do not build for windows... -if (NOT WIN32) +if ( NOT WIN32) add_subdirectory(test test) endif() # NOT WIN32 diff --git a/src/classification.cc b/src/classification.cc index d8dab03817..2d8cd26b9e 100644 --- a/src/classification.cc +++ b/src/classification.cc @@ -28,6 +28,7 @@ #include #include + #include "common.h" namespace triton { namespace server { diff --git a/src/classification.h b/src/classification.h index 27c8ba1ef6..9264baa2b0 100644 --- a/src/classification.h +++ b/src/classification.h @@ -27,6 +27,7 @@ #include #include + #include "triton/core/tritonserver.h" namespace triton { namespace server { diff --git a/src/command_line_parser.cc b/src/command_line_parser.cc new file mode 100644 index 0000000000..20307eae9f --- /dev/null +++ b/src/command_line_parser.cc @@ -0,0 +1,2244 @@ +// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// +// Redistribution and use in source and binary forms, with or without +// modification, are permitted provided that the following conditions +// are met: +// * Redistributions of source code must retain the above copyright +// notice, this list of conditions and the following disclaimer. +// * Redistributions in binary form must reproduce the above copyright +// notice, this list of conditions and the following disclaimer in the +// documentation and/or other materials provided with the distribution. +// * Neither the name of NVIDIA CORPORATION nor the names of its +// contributors may be used to endorse or promote products derived +// from this software without specific prior written permission. +// +// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +// + +#include "command_line_parser.h" +constexpr const char* GLOBAL_OPTION_GROUP = ""; + +#ifdef _WIN32 +int optind = 1; +const char* optarg = nullptr; + +/// Implementation of `getopt_long` for Windows. +/// Linux uses available implementation: +/// https://github.com/gcc-mirror/gcc/blob/fab08d12b40ad637c5a4ce8e026fb43cd3f0fad1/include/getopt.h +/// and +/// https://github.com/gcc-mirror/gcc/blob/fab08d12b40ad637c5a4ce8e026fb43cd3f0fad1/libiberty/getopt.c#L521 +/// Parameters' description is available here: +/// https://github.com/gcc-mirror/gcc/blob/fab08d12b40ad637c5a4ce8e026fb43cd3f0fad1/libiberty/getopt.c#L464-L518 +/// `optind' is an index to iterate over `argv`, (whose length is `argc`), +/// and starts from 1, since argv[0] is the program name. +/// Text in the current `argv`-element is returned in `optarg'. +/// Note: if option was provided in the form of --=, then +/// optarg is (argv[optind] + found + 1), i.e. everything after `=`. +/// Alternatively, option can be provided as -- . +/// In this case, is storred as a separate parameter in `argv`. +/// `longind` returns the index in `longopts` of the long-named option found. + +int +getopt_long( + int argc, char* const argv[], const char* optstring, + const struct option* longopts, int* longind) +{ + if (optind >= argc) { + return -1; + } + const struct option* curr_longopt = longopts; + std::string argv_str = argv[optind]; + size_t found = argv_str.find_first_of("="); + std::string key = argv_str.substr( + 2, (found == std::string::npos) ? std::string::npos : (found - 2)); + int option_index = 0; + for (curr_longopt, option_index; curr_longopt->name; + curr_longopt++, option_index++) { + if (key == curr_longopt->name) { + if (longind != NULL) + (*longind) = option_index; + if (curr_longopt->has_arg == required_argument) { + if (found == std::string::npos) { + optind++; + if (optind >= argc) { + std::cerr << argv[0] << ": option '" << argv_str + << "' requires an argument" << std::endl; + return '?'; + } + optarg = argv[optind]; + } else { + optarg = (argv[optind] + found + 1); + } + } + optind++; + return curr_longopt->val; + } + } + return -1; +} +#endif + +#include +#include +#include +#include + +#include "common.h" + +#define TRITONJSON_STATUSTYPE TRITONSERVER_Error* +#define TRITONJSON_STATUSRETURN(M) \ + return TRITONSERVER_ErrorNew(TRITONSERVER_ERROR_INTERNAL, (M).c_str()) +#define TRITONJSON_STATUSSUCCESS nullptr +#include "triton/common/triton_json.h" + + +namespace triton { namespace server { + +// [FIXME] expose following parse helpers for other type of parser +namespace { + +// A wrapper around std::stoi, std::stoull, std::stoll, std::stod +// to catch `invalid argument` and `out of range` exceptions +template +T StringTo(const std::string& arg); + +template <> +int +StringTo(const std::string& arg) +{ + return std::stoi(arg); +} + +template <> +uint64_t +StringTo(const std::string& arg) +{ + return std::stoull(arg); +} + +template <> +int64_t +StringTo(const std::string& arg) +{ + return std::stoll(arg); +} + +template <> +double +StringTo(const std::string& arg) +{ + return std::stod(arg); +} + +// There must be specialization for the types to be parsed into so that +// the argument is properly validated and parsed. Attempted to use input +// operator (>>) but it will consume improper argument without error +// (i.e. parse "1.4" to 'int' will return 1 but we want to report error). +template +T +ParseOption(const std::string& arg) +{ + try { + return StringTo(arg); + } + catch (const std::invalid_argument& ia) { + std::stringstream ss; + ss << "Invalid option value. Got " << arg << std::endl; + throw ParseException(ss.str()); + } + catch (const std::out_of_range& oor) { + std::stringstream ss; + ss << "Provided option value is out of bound. Got " << arg << std::endl; + throw ParseException(ss.str()); + } +} + +template <> +bool +ParseOption(const std::string& arg) +{ + // 'arg' need to comply with template declaration + std::string larg = arg; + std::transform(larg.begin(), larg.end(), larg.begin(), [](unsigned char c) { + return std::tolower(c); + }); + + if ((larg == "true") || (larg == "on") || (larg == "1")) { + return true; + } + if ((larg == "false") || (larg == "off") || (larg == "0")) { + return false; + } + + throw ParseException("invalid value for bool option: " + arg); +} + +// Condition here merely to avoid compilation error, this function will +// be defined but not used otherwise. +#ifdef TRITON_ENABLE_LOGGING +int +ParseIntBoolOption(std::string arg) +{ + std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) { + return std::tolower(c); + }); + + if (arg == "true") { + return 1; + } + if (arg == "false") { + return 0; + } + + return ParseOption(arg); +} +#endif // TRITON_ENABLE_LOGGING + +std::string +PairsToJsonStr(std::vector> settings) +{ + triton::common::TritonJson::Value json( + triton::common::TritonJson::ValueType::OBJECT); + for (const auto& setting : settings) { + const auto& key = setting.first; + const auto& value = setting.second; + json.SetStringObject(key.c_str(), value); + } + triton::common::TritonJson::WriteBuffer buffer; + auto err = json.Write(&buffer); + if (err != nullptr) { + LOG_TRITONSERVER_ERROR(err, "failed to convert config to JSON"); + } + return buffer.Contents(); +} + +template +std::pair +ParsePairOption(const std::string& arg, const std::string& delim_str) +{ + int delim = arg.find(delim_str); + + if ((delim < 0)) { + std::stringstream ss; + ss << "Cannot parse pair option due to incorrect number of inputs." + "-- argument requires format " + << delim_str << ". " + << "Found: " << arg << std::endl; + throw ParseException(ss.str()); + } + + std::string first_string = arg.substr(0, delim); + std::string second_string = arg.substr(delim + delim_str.length()); + + // Specific conversion from key-value string to actual key-value type, + // should be extracted out of this function if we need to parse + // more pair option of different types. + return {ParseOption(first_string), ParseOption(second_string)}; +} + +// Split 'options' by 'delim_str' and place split strings into a vector +std::vector +SplitOptions(std::string options, const std::string& delim_str) +{ + std::vector res; + + int delim = options.find(delim_str); + while ((delim >= 0)) { + res.emplace_back(options.substr(0, delim)); + options = options.substr(delim + delim_str.length()); + delim = options.find(delim_str); + } + // include last element + res.emplace_back(options); + return res; +} + +} // namespace + +enum TritonOptionId { + OPTION_HELP = 1000, +#ifdef TRITON_ENABLE_LOGGING + OPTION_LOG_VERBOSE, + OPTION_LOG_INFO, + OPTION_LOG_WARNING, + OPTION_LOG_ERROR, + OPTION_LOG_FORMAT, + OPTION_LOG_FILE, +#endif // TRITON_ENABLE_LOGGING + OPTION_ID, + OPTION_MODEL_REPOSITORY, + OPTION_EXIT_ON_ERROR, + OPTION_DISABLE_AUTO_COMPLETE_CONFIG, + OPTION_STRICT_MODEL_CONFIG, + OPTION_STRICT_READINESS, +#if defined(TRITON_ENABLE_HTTP) + OPTION_ALLOW_HTTP, + OPTION_HTTP_HEADER_FORWARD_PATTERN, + OPTION_HTTP_PORT, + OPTION_REUSE_HTTP_PORT, + OPTION_HTTP_ADDRESS, + OPTION_HTTP_THREAD_COUNT, + OPTION_HTTP_RESTRICTED_API, +#endif // TRITON_ENABLE_HTTP +#if defined(TRITON_ENABLE_GRPC) + OPTION_ALLOW_GRPC, + OPTION_GRPC_PORT, + OPTION_REUSE_GRPC_PORT, + OPTION_GRPC_ADDRESS, + OPTION_GRPC_HEADER_FORWARD_PATTERN, + OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE, + OPTION_GRPC_USE_SSL, + OPTION_GRPC_USE_SSL_MUTUAL, + OPTION_GRPC_SERVER_CERT, + OPTION_GRPC_SERVER_KEY, + OPTION_GRPC_ROOT_CERT, + OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL, + OPTION_GRPC_ARG_KEEPALIVE_TIME_MS, + OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS, + OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, + OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA, + OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS, + OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES, + OPTION_GRPC_RESTRICTED_PROTOCOL, + OPTION_GRPC_ARG_MAX_CONNECTION_AGE_MS, + OPTION_GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS, +#endif // TRITON_ENABLE_GRPC +#if defined(TRITON_ENABLE_SAGEMAKER) + OPTION_ALLOW_SAGEMAKER, + OPTION_SAGEMAKER_PORT, + OPTION_SAGEMAKER_SAFE_PORT_RANGE, + OPTION_SAGEMAKER_THREAD_COUNT, +#endif // TRITON_ENABLE_SAGEMAKER +#if defined(TRITON_ENABLE_VERTEX_AI) + OPTION_ALLOW_VERTEX_AI, + OPTION_VERTEX_AI_PORT, + OPTION_VERTEX_AI_THREAD_COUNT, + OPTION_VERTEX_AI_DEFAULT_MODEL, +#endif // TRITON_ENABLE_VERTEX_AI +#ifdef TRITON_ENABLE_METRICS + OPTION_ALLOW_METRICS, + OPTION_ALLOW_GPU_METRICS, + OPTION_ALLOW_CPU_METRICS, + OPTION_METRICS_ADDRESS, + OPTION_METRICS_PORT, + OPTION_METRICS_INTERVAL_MS, + OPTION_METRICS_CONFIG, +#endif // TRITON_ENABLE_METRICS +#ifdef TRITON_ENABLE_TRACING + OPTION_TRACE_FILEPATH, + OPTION_TRACE_LEVEL, + OPTION_TRACE_RATE, + OPTION_TRACE_COUNT, + OPTION_TRACE_LOG_FREQUENCY, + OPTION_TRACE_CONFIG, +#endif // TRITON_ENABLE_TRACING + OPTION_MODEL_CONTROL_MODE, + OPTION_POLL_REPO_SECS, + OPTION_STARTUP_MODEL, + OPTION_RATE_LIMIT, + OPTION_RATE_LIMIT_RESOURCE, + OPTION_PINNED_MEMORY_POOL_BYTE_SIZE, + OPTION_CUDA_MEMORY_POOL_BYTE_SIZE, + OPTION_CUDA_VIRTUAL_ADDRESS_SIZE, + OPTION_RESPONSE_CACHE_BYTE_SIZE, + OPTION_CACHE_CONFIG, + OPTION_CACHE_DIR, + OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY, + OPTION_EXIT_TIMEOUT_SECS, + OPTION_BACKEND_DIR, + OPTION_REPOAGENT_DIR, + OPTION_BUFFER_MANAGER_THREAD_COUNT, + OPTION_MODEL_LOAD_THREAD_COUNT, + OPTION_BACKEND_CONFIG, + OPTION_HOST_POLICY, + OPTION_MODEL_LOAD_GPU_LIMIT, + OPTION_MODEL_NAMESPACING +}; + +void +TritonParser::SetupOptions() +{ + global_options_.push_back( + {OPTION_HELP, "help", Option::ArgNone, "Print usage"}); + + server_options_.push_back( + {OPTION_ID, "id", Option::ArgStr, "Identifier for this server."}); + server_options_.push_back( + {OPTION_EXIT_TIMEOUT_SECS, "exit-timeout-secs", Option::ArgInt, + "Timeout (in seconds) when exiting to wait for in-flight inferences to " + "finish. After the timeout expires the server exits even if inferences " + "are still in flight."}); + + model_repo_options_.push_back( + {OPTION_MODEL_REPOSITORY, "model-store", Option::ArgStr, + "Equivalent to --model-repository."}); + model_repo_options_.push_back( + {OPTION_MODEL_REPOSITORY, "model-repository", Option::ArgStr, + "Path to model repository directory. It may be specified multiple times " + "to add multiple model repositories. Note that if a model is not unique " + "across all model repositories at any time, the model will not be " + "available."}); + model_repo_options_.push_back( + {OPTION_EXIT_ON_ERROR, "exit-on-error", Option::ArgBool, + "Exit the inference server if an error occurs during initialization."}); + model_repo_options_.push_back( + {OPTION_DISABLE_AUTO_COMPLETE_CONFIG, "disable-auto-complete-config", + Option::ArgNone, + "If set, disables the triton and backends from auto completing model " + "configuration files. Model configuration files must be provided and " + "all required " + "configuration settings must be specified."}); + model_repo_options_.push_back( + {OPTION_STRICT_READINESS, "strict-readiness", Option::ArgBool, + "If true /v2/health/ready endpoint indicates ready if the server " + "is responsive and all models are available. If false " + "/v2/health/ready endpoint indicates ready if server is responsive " + "even if some/all models are unavailable."}); + model_repo_options_.push_back( + {OPTION_MODEL_CONTROL_MODE, "model-control-mode", Option::ArgStr, + "Specify the mode for model management. Options are \"none\", \"poll\" " + "and \"explicit\". The default is \"none\". " + "For \"none\", the server will load all models in the model " + "repository(s) at startup and will not make any changes to the load " + "models after that. For \"poll\", the server will poll the model " + "repository(s) to detect changes and will load/unload models based on " + "those changes. The poll rate is controlled by 'repository-poll-secs'. " + "For \"explicit\", model load and unload is initiated by using the " + "model control APIs, and only models specified with --load-model will " + "be loaded at startup."}); + model_repo_options_.push_back( + {OPTION_POLL_REPO_SECS, "repository-poll-secs", Option::ArgInt, + "Interval in seconds between each poll of the model repository to check " + "for changes. Valid only when --model-control-mode=poll is " + "specified."}); + model_repo_options_.push_back( + {OPTION_STARTUP_MODEL, "load-model", Option::ArgStr, + "Name of the model to be loaded on server startup. It may be specified " + "multiple times to add multiple models. To load ALL models at startup, " + "specify '*' as the model name with --load-model=* as the ONLY " + "--load-model argument, this does not imply any pattern matching. " + "Specifying --load-model=* in conjunction with another --load-model " + "argument will result in error. Note that this option will only take " + "effect if --model-control-mode=explicit is true."}); + model_repo_options_.push_back( + {OPTION_MODEL_LOAD_THREAD_COUNT, "model-load-thread-count", + Option::ArgInt, + "The number of threads used to concurrently load models in " + "model repositories. Default is 4."}); + model_repo_options_.push_back( + {OPTION_MODEL_NAMESPACING, "model-namespacing", Option::ArgBool, + "Whether model namespacing is enable or not. If true, models with the " + "same name can be served if they are in different namespace."}); + +#if defined(TRITON_ENABLE_HTTP) + http_options_.push_back( + {OPTION_ALLOW_HTTP, "allow-http", Option::ArgBool, + "Allow the server to listen for HTTP requests."}); + http_options_.push_back( + {OPTION_HTTP_ADDRESS, "http-address", Option::ArgStr, + "The address for the http server to bind to. Default is 0.0.0.0"}); + http_options_.push_back( + {OPTION_HTTP_PORT, "http-port", Option::ArgInt, + "The port for the server to listen on for HTTP " + "requests. Default is 8000."}); + http_options_.push_back( + {OPTION_REUSE_HTTP_PORT, "reuse-http-port", Option::ArgBool, + "Allow multiple servers to listen on the same HTTP port when every " + "server has this option set. If you plan to use this option as a way to " + "load balance between different Triton servers, the same model " + "repository or set of models must be used for every server."}); + http_options_.push_back( + {OPTION_HTTP_HEADER_FORWARD_PATTERN, "http-header-forward-pattern", + Option::ArgStr, + "The regular expression pattern that will be used for forwarding HTTP " + "headers as inference request parameters."}); + http_options_.push_back( + {OPTION_HTTP_THREAD_COUNT, "http-thread-count", Option::ArgInt, + "Number of threads handling HTTP requests."}); + http_options_.push_back( + {OPTION_HTTP_RESTRICTED_API, "http-restricted-api", + ":=", + "Specify restricted HTTP api setting. The format of this " + "flag is --http-restricted-api=,=. Where " + " is a comma-separated list of apis to be restricted. " + " will be additional header key to be checked when a HTTP request " + "is received, and is the value expected to be matched." + " Allowed APIs: " + + Join(RESTRICTED_CATEGORY_NAMES, ", ")}); +#endif // TRITON_ENABLE_HTTP + +#if defined(TRITON_ENABLE_GRPC) + grpc_options_.push_back( + {OPTION_ALLOW_GRPC, "allow-grpc", Option::ArgBool, + "Allow the server to listen for GRPC requests."}); + grpc_options_.push_back( + {OPTION_GRPC_ADDRESS, "grpc-address", Option::ArgStr, + "The address for the grpc server to binds to. Default is 0.0.0.0"}); + grpc_options_.push_back( + {OPTION_GRPC_PORT, "grpc-port", Option::ArgInt, + "The port for the server to listen on for GRPC " + "requests. Default is 8001."}); + grpc_options_.push_back( + {OPTION_REUSE_GRPC_PORT, "reuse-grpc-port", Option::ArgBool, + "Allow multiple servers to listen on the same GRPC port when every " + "server has this option set. If you plan to use this option as a way to " + "load balance between different Triton servers, the same model " + "repository or set of models must be used for every server."}); + grpc_options_.push_back( + {OPTION_GRPC_HEADER_FORWARD_PATTERN, "grpc-header-forward-pattern", + Option::ArgStr, + "The regular expression pattern that will be used for forwarding GRPC " + "headers as inference request parameters."}); + grpc_options_.push_back( + {OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE, + "grpc-infer-allocation-pool-size", Option::ArgInt, + "The maximum number of inference request/response objects that remain " + "allocated for reuse. As long as the number of in-flight requests " + "doesn't exceed this value there will be no allocation/deallocation of " + "request/response objects."}); + grpc_options_.push_back( + {OPTION_GRPC_USE_SSL, "grpc-use-ssl", Option::ArgBool, + "Use SSL authentication for GRPC requests. Default is false."}); + grpc_options_.push_back( + {OPTION_GRPC_USE_SSL_MUTUAL, "grpc-use-ssl-mutual", Option::ArgBool, + "Use mututal SSL authentication for GRPC requests. This option will " + "preempt '--grpc-use-ssl' if it is also specified. Default is false."}); + grpc_options_.push_back( + {OPTION_GRPC_SERVER_CERT, "grpc-server-cert", Option::ArgStr, + "File holding PEM-encoded server certificate. Ignored unless " + "--grpc-use-ssl is true."}); + grpc_options_.push_back( + {OPTION_GRPC_SERVER_KEY, "grpc-server-key", Option::ArgStr, + "File holding PEM-encoded server key. Ignored unless " + "--grpc-use-ssl is true."}); + grpc_options_.push_back( + {OPTION_GRPC_ROOT_CERT, "grpc-root-cert", Option::ArgStr, + "File holding PEM-encoded root certificate. Ignore unless " + "--grpc-use-ssl is false."}); + grpc_options_.push_back( + {OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL, + "grpc-infer-response-compression-level", Option::ArgStr, + "The compression level to be used while returning the infer response to " + "the peer. Allowed values are none, low, medium and high. By default, " + "compression level is selected as none."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_KEEPALIVE_TIME_MS, "grpc-keepalive-time", Option::ArgInt, + "The period (in milliseconds) after which a keepalive ping is sent on " + "the transport. Default is 7200000 (2 hours)."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS, "grpc-keepalive-timeout", + Option::ArgInt, + "The period (in milliseconds) the sender of the keepalive ping waits " + "for an acknowledgement. If it does not receive an acknowledgment " + "within this time, it will close the connection. " + "Default is 20000 (20 seconds)."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, + "grpc-keepalive-permit-without-calls", Option::ArgBool, + "Allows keepalive pings to be sent even if there are no calls in flight " + "(0 : false; 1 : true). Default is 0 (false)."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA, + "grpc-http2-max-pings-without-data", Option::ArgInt, + "The maximum number of pings that can be sent when there is no " + "data/header frame to be sent. gRPC Core will not continue sending " + "pings if we run over the limit. Setting it to 0 allows sending pings " + "without such a restriction. Default is 2."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS, + "grpc-http2-min-recv-ping-interval-without-data", Option::ArgInt, + "If there are no data/header frames being sent on the transport, this " + "channel argument on the server side controls the minimum time " + "(in milliseconds) that gRPC Core would expect between receiving " + "successive pings. If the time between successive pings is less than " + "this time, then the ping will be considered a bad ping from the peer. " + "Such a ping counts as a ‘ping strike’. Default is 300000 (5 " + "minutes)."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES, "grpc-http2-max-ping-strikes", + Option::ArgInt, + "Maximum number of bad pings that the server will tolerate before " + "sending an HTTP2 GOAWAY frame and closing the transport. Setting it to " + "0 allows the server to accept any number of bad pings. Default is 2."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_MAX_CONNECTION_AGE_MS, "grpc-max-connection-age", + Option::ArgInt, + "Maximum time that a channel may exist in milliseconds." + "Default is undefined."}); + grpc_options_.push_back( + {OPTION_GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS, + "grpc-max-connection-age-grace", Option::ArgInt, + "Grace period after the channel reaches its max age. " + "Default is undefined."}); + grpc_options_.push_back( + {OPTION_GRPC_RESTRICTED_PROTOCOL, "grpc-restricted-protocol", + ":=", + "Specify restricted GRPC protocol setting. The format of this " + "flag is --grpc-restricted-protocol=,=. Where " + " is a comma-separated list of protocols to be restricted. " + " will be additional header key to be checked when a GRPC request " + "is received, and is the value expected to be matched." + " Allowed protocols: " + + Join(RESTRICTED_CATEGORY_NAMES, ", ")}); +#endif // TRITON_ENABLE_GRPC + +#ifdef TRITON_ENABLE_LOGGING + logging_options_.push_back( + {OPTION_LOG_VERBOSE, "log-verbose", Option::ArgInt, + "Set verbose logging level. Zero (0) disables verbose logging and " + "values >= 1 enable verbose logging."}); + logging_options_.push_back( + {OPTION_LOG_INFO, "log-info", Option::ArgBool, + "Enable/disable info-level logging."}); + logging_options_.push_back( + {OPTION_LOG_WARNING, "log-warning", Option::ArgBool, + "Enable/disable warning-level logging."}); + logging_options_.push_back( + {OPTION_LOG_ERROR, "log-error", Option::ArgBool, + "Enable/disable error-level logging."}); + logging_options_.push_back( + {OPTION_LOG_FORMAT, "log-format", Option::ArgStr, + "Set the logging format. Options are \"default\" and \"ISO8601\". " + "The default is \"default\". For \"default\", the log severity (L) and " + "timestamp will be logged as \"LMMDD hh:mm:ss.ssssss\". " + "For \"ISO8601\", the log format will be \"YYYY-MM-DDThh:mm:ssZ L\"."}); + logging_options_.push_back( + {OPTION_LOG_FILE, "log-file", Option::ArgStr, + "Set the name of the log output file. If specified, log outputs will be " + "saved to this file. If not specified, log outputs will stream to the " + "console."}); +#endif // TRITON_ENABLE_LOGGING + +#if defined(TRITON_ENABLE_SAGEMAKER) + sagemaker_options_.push_back( + {OPTION_ALLOW_SAGEMAKER, "allow-sagemaker", Option::ArgBool, + "Allow the server to listen for Sagemaker requests. Default is false."}); + sagemaker_options_.push_back( + {OPTION_SAGEMAKER_PORT, "sagemaker-port", Option::ArgInt, + "The port for the server to listen on for Sagemaker requests. Default " + "is 8080."}); + sagemaker_options_.push_back( + {OPTION_SAGEMAKER_SAFE_PORT_RANGE, "sagemaker-safe-port-range", + "-", + "Set the allowed port range for endpoints other than the SageMaker " + "endpoints."}); + sagemaker_options_.push_back( + {OPTION_SAGEMAKER_THREAD_COUNT, "sagemaker-thread-count", Option::ArgInt, + "Number of threads handling Sagemaker requests. Default is 8."}); +#endif // TRITON_ENABLE_SAGEMAKER + +#if defined(TRITON_ENABLE_VERTEX_AI) + vertex_options_.push_back( + {OPTION_ALLOW_VERTEX_AI, "allow-vertex-ai", Option::ArgBool, + "Allow the server to listen for Vertex AI requests. Default is true if " + "AIP_MODE=PREDICTION, false otherwise."}); + vertex_options_.push_back( + {OPTION_VERTEX_AI_PORT, "vertex-ai-port", Option::ArgInt, + "The port for the server to listen on for Vertex AI requests. Default " + "is AIP_HTTP_PORT if set, 8080 otherwise."}); + vertex_options_.push_back( + {OPTION_VERTEX_AI_THREAD_COUNT, "vertex-ai-thread-count", Option::ArgInt, + "Number of threads handling Vertex AI requests. Default is 8."}); + vertex_options_.push_back( + {OPTION_VERTEX_AI_DEFAULT_MODEL, "vertex-ai-default-model", + Option::ArgStr, + "The name of the model to use for single-model inference requests."}); +#endif // TRITON_ENABLE_VERTEX_AI + +#if defined(TRITON_ENABLE_METRICS) + metric_options_.push_back( + {OPTION_ALLOW_METRICS, "allow-metrics", Option::ArgBool, + "Allow the server to provide prometheus metrics."}); + metric_options_.push_back( + {OPTION_ALLOW_GPU_METRICS, "allow-gpu-metrics", Option::ArgBool, + "Allow the server to provide GPU metrics. Ignored unless " + "--allow-metrics is true."}); + metric_options_.push_back( + {OPTION_ALLOW_CPU_METRICS, "allow-cpu-metrics", Option::ArgBool, + "Allow the server to provide CPU metrics. Ignored unless " + "--allow-metrics is true."}); + metric_options_.push_back( + {OPTION_METRICS_ADDRESS, "metrics-address", Option::ArgStr, + "The address for the metrics server to bind to. Default is the same as " + "--http-address if built with HTTP support. Otherwise, default is " + "0.0.0.0"}); + metric_options_.push_back( + {OPTION_METRICS_PORT, "metrics-port", Option::ArgInt, + "The port reporting prometheus metrics. Default is 8002."}); + metric_options_.push_back( + {OPTION_METRICS_INTERVAL_MS, "metrics-interval-ms", Option::ArgFloat, + "Metrics will be collected once every " + "milliseconds. Default is 2000 milliseconds."}); + metric_options_.push_back( + {OPTION_METRICS_CONFIG, "metrics-config", "=", + "Specify a metrics-specific configuration setting. The format of this " + "flag is --metrics-config==. It can be specified " + "multiple times."}); +#endif // TRITON_ENABLE_METRICS + +#ifdef TRITON_ENABLE_TRACING + tracing_options_.push_back( + {OPTION_TRACE_CONFIG, "trace-config", ",=", + "Specify global or trace mode specific configuration setting. " + "The format of this flag is --trace-config " + ",=. " + "Where is either \"triton\" or \"opentelemetry\". " + "The default is \"triton\". To specify global trace settings " + "(level, rate, count, or mode), the format would be " + "--trace-config =. For \"triton\" mode, the server will " + "use " + "Triton's Trace APIs. For \"opentelemetry\" mode, the server will use " + "OpenTelemetry's APIs to generate, collect and export traces for " + "individual inference requests."}); +#endif // TRITON_ENABLE_TRACING + + cache_options_.push_back( + {OPTION_CACHE_CONFIG, "cache-config", ",=", + "Specify a cache-specific configuration setting. The format of this " + "flag is --cache-config=,=. Where " + " is the name of the cache, such as 'local' or 'redis'. " + "Example: --cache-config=local,size=1048576 will configure a 'local' " + "cache implementation with a fixed buffer pool of size 1048576 bytes."}); + cache_options_.push_back( + {OPTION_CACHE_DIR, "cache-directory", Option::ArgStr, + "The global directory searched for cache shared libraries. Default is " + "'/opt/tritonserver/caches'. This directory is expected to contain a " + "cache implementation as a shared library with the name " + "'libtritoncache.so'."}); + + + rate_limiter_options_.push_back( + // FIXME: fix the default to execution_count once RL logic is complete. + {OPTION_RATE_LIMIT, "rate-limit", Option::ArgStr, + "Specify the mode for rate limiting. Options are \"execution_count\" " + "and \"off\". The default is \"off\". For " + "\"execution_count\", the server will determine the instance using " + "configured priority and the number of time the instance has been " + "used to run inference. The inference will finally be executed once " + "the required resources are available. For \"off\", the server will " + "ignore any rate limiter config and run inference as soon as an " + "instance is ready."}); + rate_limiter_options_.push_back( + {OPTION_RATE_LIMIT_RESOURCE, "rate-limit-resource", + "::", + "The number of resources available to the server. The format of this " + "flag is --rate-limit-resource=::. The " + " is optional and if not listed will be applied to every " + "device. If the resource is specified as \"GLOBAL\" in the model " + "configuration the resource is considered shared among all the devices " + "in the system. The property is ignored for such resources. " + "This flag can be specified multiple times to specify each resources " + "and their availability. By default, the max across all instances that " + "list the resource is selected as its availability. The values for this " + "flag is case-insensitive."}); + + memory_device_options_.push_back( + {OPTION_PINNED_MEMORY_POOL_BYTE_SIZE, "pinned-memory-pool-byte-size", + Option::ArgInt, + "The total byte size that can be allocated as pinned system memory. " + "If GPU support is enabled, the server will allocate pinned system " + "memory to accelerate data transfer between host and devices until it " + "exceeds the specified byte size. If 'numa-node' is configured via " + "--host-policy, the pinned system memory of the pool size will be " + "allocated on each numa node. This option will not affect the " + "allocation conducted by the backend frameworks. Default is 256 MB."}); + memory_device_options_.push_back( + {OPTION_CUDA_MEMORY_POOL_BYTE_SIZE, "cuda-memory-pool-byte-size", + ":", + "The total byte size that can be allocated as CUDA memory for the GPU " + "device. If GPU support is enabled, the server will allocate CUDA " + "memory to minimize data transfer between host and devices until it " + "exceeds the specified byte size. This option will not affect the " + "allocation conducted by the backend frameworks. The argument should be " + "2 integers separated by colons in the format " + ":. This option can be used multiple " + "times, but only once per GPU device. Subsequent uses will overwrite " + "previous uses for the same GPU device. Default is 64 MB."}); + memory_device_options_.push_back( + {OPTION_CUDA_VIRTUAL_ADDRESS_SIZE, "cuda-virtual-address-size", + ":", + "The total CUDA virtual address size that will be used for each " + "implicit state when growable memory is used. This value determines " + "the maximum size of each implicit state. The state size cannot go " + "beyond this value. The argument should be " + "2 integers separated by colons in the format " + ":. This option can be used " + "multiple " + "times, but only once per GPU device. Subsequent uses will overwrite " + "previous uses for the same GPU device. Default is 1 GB."}); + memory_device_options_.push_back( + {OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY, + "min-supported-compute-capability", Option::ArgFloat, + "The minimum supported CUDA compute capability. GPUs that don't support " + "this compute capability will not be used by the server."}); + memory_device_options_.push_back( + {OPTION_BUFFER_MANAGER_THREAD_COUNT, "buffer-manager-thread-count", + Option::ArgInt, + "The number of threads used to accelerate copies and other operations " + "required to manage input and output tensor contents. Default is 0."}); + memory_device_options_.push_back( + {OPTION_HOST_POLICY, "host-policy", ",=", + "Specify a host policy setting associated with a policy name. The " + "format of this flag is --host-policy=,=. " + "Currently supported settings are 'numa-node', 'cpu-cores'. Note that " + "'numa-node' setting will affect pinned memory pool behavior, see " + "--pinned-memory-pool for more detail."}); + memory_device_options_.push_back( + {OPTION_MODEL_LOAD_GPU_LIMIT, "model-load-gpu-limit", + ":", + "Specify the limit on GPU memory usage as a fraction. If model loading " + "on the device is requested and the current memory usage exceeds the " + "limit, the load will be rejected. If not specified, the limit will " + "not be set."}); + + backend_options_.push_back( + {OPTION_BACKEND_DIR, "backend-directory", Option::ArgStr, + "The global directory searched for backend shared libraries. Default is " + "'/opt/tritonserver/backends'."}); + backend_options_.push_back( + {OPTION_BACKEND_CONFIG, "backend-config", ",=", + "Specify a backend-specific configuration setting. The format of this " + "flag is --backend-config=,=. Where " + " is the name of the backend, such as 'tensorrt'."}); + + repo_agent_options_.push_back( + {OPTION_REPOAGENT_DIR, "repoagent-directory", Option::ArgStr, + "The global directory searched for repository agent shared libraries. " + "Default is '/opt/tritonserver/repoagents'."}); + + // Deprecations + deprecated_options_.push_back( + {OPTION_STRICT_MODEL_CONFIG, "strict-model-config", Option::ArgBool, + "DEPRECATED: If true model configuration files must be provided and all " + "required " + "configuration settings must be specified. If false the model " + "configuration may be absent or only partially specified and the " + "server will attempt to derive the missing required configuration."}); + deprecated_options_.push_back( + {OPTION_RESPONSE_CACHE_BYTE_SIZE, "response-cache-byte-size", + Option::ArgInt, "DEPRECATED: Please use --cache-config instead."}); +#ifdef TRITON_ENABLE_TRACING + deprecated_options_.push_back( + {OPTION_TRACE_FILEPATH, "trace-file", Option::ArgStr, + "DEPRECATED: Please use --trace-config triton,file=" + " Set the file where trace output will be saved. If " + "--trace-log-frequency" + " is also specified, this argument value will be the prefix of the files" + " to save the trace output. See --trace-log-frequency for detail."}); + deprecated_options_.push_back( + {OPTION_TRACE_LEVEL, "trace-level", Option::ArgStr, + "DEPRECATED: Please use --trace-config level=" + "Specify a trace level. OFF to disable tracing, TIMESTAMPS to " + "trace timestamps, TENSORS to trace tensors. It may be specified " + "multiple times to trace multiple information. Default is OFF."}); + deprecated_options_.push_back( + {OPTION_TRACE_RATE, "trace-rate", Option::ArgInt, + "DEPRECATED: Please use --trace-config rate=" + "Set the trace sampling rate. Default is 1000."}); + deprecated_options_.push_back( + {OPTION_TRACE_COUNT, "trace-count", Option::ArgInt, + "DEPRECATED: Please use --trace-config count=" + "Set the number of traces to be sampled. If the value is -1, the number " + "of traces to be sampled will not be limited. Default is -1."}); + deprecated_options_.push_back( + {OPTION_TRACE_LOG_FREQUENCY, "trace-log-frequency", Option::ArgInt, + "DEPRECATED: Please use --trace-config triton,log-frequency=" + "Set the trace log frequency. If the value is 0, Triton will only log " + "the trace output to when shutting down. Otherwise, Triton " + "will log the trace output to . when it collects the " + "specified number of traces. For example, if the log frequency is 100, " + "when Triton collects the 100-th trace, it logs the traces to file " + ".0, and when it collects the 200-th trace, it logs the " + "101-th to the 200-th traces to file .1. Default is 0."}); +#endif // TRITON_ENABLE_TRACING +} + +void +TritonParser::SetupOptionGroups() +{ + SetupOptions(); + option_groups_.emplace_back(GLOBAL_OPTION_GROUP, global_options_); + option_groups_.emplace_back("Server", server_options_); + option_groups_.emplace_back("Logging", logging_options_); + option_groups_.emplace_back("Model Repository", model_repo_options_); + option_groups_.emplace_back("HTTP", http_options_); + option_groups_.emplace_back("GRPC", grpc_options_); + option_groups_.emplace_back("Sagemaker", sagemaker_options_); + option_groups_.emplace_back("Vertex", vertex_options_); + option_groups_.emplace_back("Metrics", metric_options_); + option_groups_.emplace_back("Tracing", tracing_options_); + option_groups_.emplace_back("Backend", backend_options_); + option_groups_.emplace_back("Repository Agent", repo_agent_options_); + option_groups_.emplace_back("Response Cache", cache_options_); + option_groups_.emplace_back("Rate Limiter", rate_limiter_options_); + option_groups_.emplace_back( + "Memory/Device Management", memory_device_options_); + option_groups_.emplace_back("DEPRECATED", deprecated_options_); +} + +TritonParser::TritonParser() +{ + SetupOptionGroups(); +} + +void +TritonServerParameters::CheckPortCollision() +{ + // [FIXME] try to make this function endpoint type agnostic + // List of enabled services and their constraints + std::vector< + std::tuple> + ports; +#ifdef TRITON_ENABLE_HTTP + if (allow_http_) { + ports.emplace_back("HTTP", http_address_, http_port_, false, -1, -1); + } +#endif // TRITON_ENABLE_HTTP +#ifdef TRITON_ENABLE_GRPC + if (allow_grpc_) { + ports.emplace_back( + "GRPC", grpc_options_.socket_.address_, grpc_options_.socket_.port_, + false, -1, -1); + } +#endif // TRITON_ENABLE_GRPC +#ifdef TRITON_ENABLE_METRICS + if (allow_metrics_) { + ports.emplace_back( + "metrics", metrics_address_, metrics_port_, false, -1, -1); + } +#endif // TRITON_ENABLE_METRICS +#ifdef TRITON_ENABLE_SAGEMAKER + if (allow_sagemaker_) { + ports.emplace_back( + "SageMaker", sagemaker_address_, sagemaker_port_, + sagemaker_safe_range_set_, sagemaker_safe_range_.first, + sagemaker_safe_range_.second); + } +#endif // TRITON_ENABLE_SAGEMAKER +#ifdef TRITON_ENABLE_VERTEX_AI + if (allow_vertex_ai_) { + ports.emplace_back( + "Vertex AI", vertex_ai_address_, vertex_ai_port_, false, -1, -1); + } +#endif // TRITON_ENABLE_VERTEX_AI + + for (auto curr_it = ports.begin(); curr_it != ports.end(); ++curr_it) { + // If the current service doesn't specify the allow port range for other + // services, then we don't need to revisit the checked services + auto comparing_it = (std::get<3>(*curr_it)) ? ports.begin() : (curr_it + 1); + for (; comparing_it != ports.end(); ++comparing_it) { + if (comparing_it == curr_it) { + continue; + } + if (std::get<1>(*curr_it) != std::get<1>(*comparing_it)) { + continue; + } + // Set range and comparing service port is out of range + if (std::get<3>(*curr_it) && + ((std::get<2>(*comparing_it) < std::get<4>(*curr_it)) || + (std::get<2>(*comparing_it) > std::get<5>(*curr_it)))) { + std::stringstream ss; + ss << "The server cannot listen to " << std::get<0>(*comparing_it) + << " requests at port " << std::get<2>(*comparing_it) + << ", allowed port range is [" << std::get<4>(*curr_it) << ", " + << std::get<5>(*curr_it) << "]" << std::endl; + throw ParseException(ss.str()); + } + if (std::get<2>(*curr_it) == std::get<2>(*comparing_it)) { + std::stringstream ss; + ss << "The server cannot listen to " << std::get<0>(*curr_it) + << " requests " + << "and " << std::get<0>(*comparing_it) + << " requests at the same address and port " << std::get<1>(*curr_it) + << ":" << std::get<2>(*curr_it) << std::endl; + throw ParseException(ss.str()); + } + } + } +} + +TritonServerParameters::ManagedTritonServerOptionPtr +TritonServerParameters::BuildTritonServerOptions() +{ + TRITONSERVER_ServerOptions* loptions = nullptr; + THROW_IF_ERR( + ParseException, TRITONSERVER_ServerOptionsNew(&loptions), + "creating server options"); + ManagedTritonServerOptionPtr managed_ptr( + loptions, TRITONSERVER_ServerOptionsDelete); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetServerId(loptions, server_id_.c_str()), + "setting server ID"); + for (const auto& model_repository_path : model_repository_paths_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetModelRepositoryPath( + loptions, model_repository_path.c_str()), + "setting model repository path"); + } + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetModelControlMode(loptions, control_mode_), + "setting model control mode"); + for (const auto& model : startup_models_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetStartupModel(loptions, model.c_str()), + "setting startup model"); + } + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetRateLimiterMode(loptions, rate_limit_mode_), + "setting rate limiter configuration"); + for (const auto& resource : rate_limit_resources_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsAddRateLimiterResource( + loptions, std::get<0>(resource).c_str(), std::get<1>(resource), + std::get<2>(resource)), + "setting rate limiter resource"); + } + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetPinnedMemoryPoolByteSize( + loptions, pinned_memory_pool_byte_size_), + "setting total pinned memory byte size"); + for (const auto& cuda_pool : cuda_pools_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetCudaMemoryPoolByteSize( + loptions, cuda_pool.first, cuda_pool.second), + "setting total CUDA memory byte size"); + } + for (const auto& cuda_virtual_address_size : cuda_virtual_address_size_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetCudaVirtualAddressSize( + loptions, cuda_virtual_address_size.first, + cuda_virtual_address_size.second), + "setting total CUDA virtual address size"); + } + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetMinSupportedComputeCapability( + loptions, min_supported_compute_capability_), + "setting minimum supported CUDA compute capability"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetExitOnError(loptions, exit_on_error_), + "setting exit on error"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetStrictModelConfig( + loptions, strict_model_config_), + "setting strict model configuration"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetStrictReadiness(loptions, strict_readiness_), + "setting strict readiness"); + // [FIXME] std::max seems to be part of Parse() + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetExitTimeout( + loptions, std::max(0, exit_timeout_secs_)), + "setting exit timeout"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetBufferManagerThreadCount( + loptions, std::max(0, buffer_manager_thread_count_)), + "setting buffer manager thread count"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetModelLoadThreadCount( + loptions, std::max(1u, model_load_thread_count_)), + "setting model load thread count"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetModelNamespacing( + loptions, enable_model_namespacing_), + "setting model namespacing"); + +#ifdef TRITON_ENABLE_LOGGING + TRITONSERVER_ServerOptionsSetLogFile(loptions, log_file_.c_str()); + THROW_IF_ERR( + ParseException, TRITONSERVER_ServerOptionsSetLogInfo(loptions, log_info_), + "setting log info enable"); + THROW_IF_ERR( + ParseException, TRITONSERVER_ServerOptionsSetLogWarn(loptions, log_warn_), + "setting log warn enable"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetLogError(loptions, log_error_), + "setting log error enable"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetLogVerbose(loptions, log_verbose_), + "setting log verbose level"); + switch (log_format_) { + case triton::common::Logger::Format::kDEFAULT: + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetLogFormat( + loptions, TRITONSERVER_LOG_DEFAULT), + "setting log format"); + break; + case triton::common::Logger::Format::kISO8601: + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetLogFormat( + loptions, TRITONSERVER_LOG_ISO8601), + "setting log format"); + break; + } +#endif // TRITON_ENABLE_LOGGING + +#ifdef TRITON_ENABLE_METRICS + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetMetrics(loptions, allow_metrics_), + "setting metrics enable"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetGpuMetrics(loptions, allow_gpu_metrics_), + "setting GPU metrics enable"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetCpuMetrics(loptions, allow_cpu_metrics_), + "setting CPU metrics enable"); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetMetricsInterval( + loptions, metrics_interval_ms_), + "setting metrics interval"); + for (const auto& mcs : metrics_config_settings_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetMetricsConfig( + loptions, std::get<0>(mcs).c_str(), std::get<1>(mcs).c_str(), + std::get<2>(mcs).c_str()), + "setting metrics configuration"); + } + +#endif // TRITON_ENABLE_METRICS + + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetBackendDirectory( + loptions, backend_dir_.c_str()), + "setting backend directory"); + + // Enable cache and configure it if a cache CLI arg is passed, + // this will allow for an empty configuration. + if (enable_cache_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetCacheDirectory( + loptions, cache_dir_.c_str()), + "setting cache directory"); + + for (const auto& cache_pair : cache_config_settings_) { + const auto& cache_name = cache_pair.first; + const auto& settings = cache_pair.second; + const auto& json_config_str = PairsToJsonStr(settings); + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetCacheConfig( + loptions, cache_name.c_str(), json_config_str.c_str()), + "setting cache configuration"); + } + } + + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetRepoAgentDirectory( + loptions, repoagent_dir_.c_str()), + "setting repository agent directory"); + for (const auto& bcs : backend_config_settings_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetBackendConfig( + loptions, std::get<0>(bcs).c_str(), std::get<1>(bcs).c_str(), + std::get<2>(bcs).c_str()), + "setting backend configuration"); + } + for (const auto& limit : load_gpu_limit_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetModelLoadDeviceLimit( + loptions, TRITONSERVER_INSTANCEGROUPKIND_GPU, limit.first, + limit.second), + "setting model load GPU limit"); + } + for (const auto& hp : host_policies_) { + THROW_IF_ERR( + ParseException, + TRITONSERVER_ServerOptionsSetHostPolicy( + loptions, std::get<0>(hp).c_str(), std::get<1>(hp).c_str(), + std::get<2>(hp).c_str()), + "setting host policy"); + } + return managed_ptr; +} + +std::pair> +TritonParser::Parse(int argc, char** argv) +{ + // + // Step 1. Before parsing setup + // + TritonServerParameters lparams; + bool strict_model_config_present{false}; + bool disable_auto_complete_config{false}; + bool cache_size_present{false}; + bool cache_config_present{false}; +#ifdef TRITON_ENABLE_TRACING + bool explicit_disable_trace{false}; + bool trace_filepath_present{false}; + bool trace_level_present{false}; + bool trace_rate_present{false}; + bool trace_count_present{false}; + bool trace_log_frequency_present{false}; +#endif // TRITON_ENABLE_TRACING + int option_index = 0; + +#ifdef TRITON_ENABLE_GRPC + triton::server::grpc::Options& lgrpc_options = lparams.grpc_options_; +#endif // TRITON_ENABLE_GRPC + +#ifdef TRITON_ENABLE_VERTEX_AI + // Set different default value if specific flag is set + { + auto aip_mode = + triton::server::GetEnvironmentVariableOrDefault("AIP_MODE", ""); + // Enable Vertex AI service and disable HTTP / GRPC service by default + // if detecting Vertex AI environment + if (aip_mode == "PREDICTION") { + lparams.allow_vertex_ai_ = true; +#ifdef TRITON_ENABLE_HTTP + lparams.allow_http_ = false; +#endif // TRITON_ENABLE_HTTP +#ifdef TRITON_ENABLE_GRPC + lparams.allow_grpc_ = false; +#endif // TRITON_ENABLE_GRPC + } + auto port = triton::server::GetEnvironmentVariableOrDefault( + "AIP_HTTP_PORT", "8080"); + lparams.vertex_ai_port_ = ParseOption(port); + } +#endif // TRITON_ENABLE_VERTEX_AI + + // + // Step 2. parse options + // + std::vector long_options; + for (const auto& group : option_groups_) { + for (const auto& o : group.second) { + long_options.push_back(o.GetLongOption()); + } + } + long_options.push_back({nullptr, 0, nullptr, 0}); + + int flag; + while ((flag = getopt_long( + argc, argv, "", &long_options[0], &option_index)) != -1) { + try { + switch (flag) { + case OPTION_HELP: + // [FIXME] how help is printed? + case '?': + // [FIXME] fall through when seeing this, currently consumes all + // options [FIXME] disable stderr output of `getopt_long` + throw ParseException(); +#ifdef TRITON_ENABLE_LOGGING + case OPTION_LOG_VERBOSE: + lparams.log_verbose_ = ParseIntBoolOption(optarg); + break; + case OPTION_LOG_INFO: + lparams.log_info_ = ParseOption(optarg); + break; + case OPTION_LOG_WARNING: + lparams.log_warn_ = ParseOption(optarg); + break; + case OPTION_LOG_ERROR: + lparams.log_error_ = ParseOption(optarg); + break; + case OPTION_LOG_FORMAT: { + std::string format_str(optarg); + if (format_str == "default") { + lparams.log_format_ = triton::common::Logger::Format::kDEFAULT; + } else if (format_str == "ISO8601") { + lparams.log_format_ = triton::common::Logger::Format::kISO8601; + } else { + throw ParseException("invalid argument for --log-format"); + } + break; + } + case OPTION_LOG_FILE: + lparams.log_file_ = optarg; + break; +#endif // TRITON_ENABLE_LOGGING + + case OPTION_ID: + lparams.server_id_ = optarg; + break; + case OPTION_MODEL_REPOSITORY: + lparams.model_repository_paths_.insert(optarg); + break; + case OPTION_EXIT_ON_ERROR: + lparams.exit_on_error_ = ParseOption(optarg); + break; + case OPTION_DISABLE_AUTO_COMPLETE_CONFIG: + disable_auto_complete_config = true; + break; + case OPTION_STRICT_MODEL_CONFIG: + std::cerr << "Warning: '--strict-model-config' has been deprecated! " + "Please use '--disable-auto-complete-config' instead." + << std::endl; + strict_model_config_present = true; + lparams.strict_model_config_ = ParseOption(optarg); + break; + case OPTION_STRICT_READINESS: + lparams.strict_readiness_ = ParseOption(optarg); + break; + +#ifdef TRITON_ENABLE_HTTP + case OPTION_ALLOW_HTTP: + lparams.allow_http_ = ParseOption(optarg); + break; + case OPTION_HTTP_PORT: + lparams.http_port_ = ParseOption(optarg); + break; + case OPTION_REUSE_HTTP_PORT: + lparams.reuse_http_port_ = ParseOption(optarg); + break; + case OPTION_HTTP_ADDRESS: + lparams.http_address_ = optarg; + break; + case OPTION_HTTP_HEADER_FORWARD_PATTERN: + lparams.http_forward_header_pattern_ = optarg; + break; + break; + case OPTION_HTTP_THREAD_COUNT: + lparams.http_thread_cnt_ = ParseOption(optarg); + break; + case OPTION_HTTP_RESTRICTED_API: + ParseRestrictedFeatureOption( + optarg, long_options[option_index].name, "", "api", + lparams.http_restricted_apis_); + break; + +#endif // TRITON_ENABLE_HTTP + +#ifdef TRITON_ENABLE_SAGEMAKER + case OPTION_ALLOW_SAGEMAKER: + lparams.allow_sagemaker_ = ParseOption(optarg); + break; + case OPTION_SAGEMAKER_PORT: + lparams.sagemaker_port_ = ParseOption(optarg); + break; + case OPTION_SAGEMAKER_SAFE_PORT_RANGE: + lparams.sagemaker_safe_range_set_ = true; + lparams.sagemaker_safe_range_ = + ParsePairOption(optarg, "-"); + break; + case OPTION_SAGEMAKER_THREAD_COUNT: + lparams.sagemaker_thread_cnt_ = ParseOption(optarg); + break; +#endif // TRITON_ENABLE_SAGEMAKER + +#ifdef TRITON_ENABLE_VERTEX_AI + case OPTION_ALLOW_VERTEX_AI: + lparams.allow_vertex_ai_ = ParseOption(optarg); + break; + case OPTION_VERTEX_AI_PORT: + lparams.vertex_ai_port_ = ParseOption(optarg); + break; + case OPTION_VERTEX_AI_THREAD_COUNT: + lparams.vertex_ai_thread_cnt_ = ParseOption(optarg); + break; + case OPTION_VERTEX_AI_DEFAULT_MODEL: + lparams.vertex_ai_default_model_ = optarg; + break; +#endif // TRITON_ENABLE_VERTEX_AI + +#ifdef TRITON_ENABLE_GRPC + case OPTION_ALLOW_GRPC: + lparams.allow_grpc_ = ParseOption(optarg); + break; + case OPTION_GRPC_PORT: + lgrpc_options.socket_.port_ = ParseOption(optarg); + break; + case OPTION_REUSE_GRPC_PORT: + lgrpc_options.socket_.reuse_port_ = ParseOption(optarg); + break; + case OPTION_GRPC_ADDRESS: + lgrpc_options.socket_.address_ = optarg; + break; + case OPTION_GRPC_INFER_ALLOCATION_POOL_SIZE: + lgrpc_options.infer_allocation_pool_size_ = ParseOption(optarg); + break; + case OPTION_GRPC_USE_SSL: + lgrpc_options.ssl_.use_ssl_ = ParseOption(optarg); + break; + case OPTION_GRPC_USE_SSL_MUTUAL: + lgrpc_options.ssl_.use_mutual_auth_ = ParseOption(optarg); + lgrpc_options.ssl_.use_ssl_ = true; + break; + case OPTION_GRPC_SERVER_CERT: + lgrpc_options.ssl_.server_cert_ = optarg; + break; + case OPTION_GRPC_SERVER_KEY: + lgrpc_options.ssl_.server_key_ = optarg; + break; + case OPTION_GRPC_ROOT_CERT: + lgrpc_options.ssl_.root_cert_ = optarg; + break; + case OPTION_GRPC_RESPONSE_COMPRESSION_LEVEL: { + std::string mode_str(optarg); + std::transform( + mode_str.begin(), mode_str.end(), mode_str.begin(), ::tolower); + if (mode_str == "none") { + lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_NONE; + } else if (mode_str == "low") { + lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_LOW; + } else if (mode_str == "medium") { + lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_MED; + } else if (mode_str == "high") { + lgrpc_options.infer_compression_level_ = GRPC_COMPRESS_LEVEL_HIGH; + } else { + throw ParseException( + "invalid argument for " + "--grpc_infer_response_compression_level"); + } + break; + } + case OPTION_GRPC_ARG_KEEPALIVE_TIME_MS: + lgrpc_options.keep_alive_.keepalive_time_ms_ = + ParseOption(optarg); + break; + case OPTION_GRPC_ARG_KEEPALIVE_TIMEOUT_MS: + lgrpc_options.keep_alive_.keepalive_timeout_ms_ = + ParseOption(optarg); + break; + case OPTION_GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS: + lgrpc_options.keep_alive_.keepalive_permit_without_calls_ = + ParseOption(optarg); + break; + case OPTION_GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA: + lgrpc_options.keep_alive_.http2_max_pings_without_data_ = + ParseOption(optarg); + break; + case OPTION_GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS: + lgrpc_options.keep_alive_ + .http2_min_recv_ping_interval_without_data_ms_ = + ParseOption(optarg); + break; + case OPTION_GRPC_ARG_HTTP2_MAX_PING_STRIKES: + lgrpc_options.keep_alive_.http2_max_ping_strikes_ = + ParseOption(optarg); + break; + case OPTION_GRPC_ARG_MAX_CONNECTION_AGE_MS: + lgrpc_options.keep_alive_.max_connection_age_ms_ = + ParseOption(optarg); + break; + case OPTION_GRPC_ARG_MAX_CONNECTION_AGE_GRACE_MS: + lgrpc_options.keep_alive_.max_connection_age_grace_ms_ = + ParseOption(optarg); + break; + case OPTION_GRPC_RESTRICTED_PROTOCOL: { + ParseRestrictedFeatureOption( + optarg, long_options[option_index].name, + std::string( + triton::server::grpc::kRestrictedProtocolHeaderTemplate), + "protocol", lgrpc_options.restricted_protocols_); + break; + } + case OPTION_GRPC_HEADER_FORWARD_PATTERN: + lgrpc_options.forward_header_pattern_ = optarg; + break; +#endif // TRITON_ENABLE_GRPC + +#ifdef TRITON_ENABLE_METRICS + case OPTION_ALLOW_METRICS: + lparams.allow_metrics_ = ParseOption(optarg); + break; + case OPTION_ALLOW_GPU_METRICS: + lparams.allow_gpu_metrics_ = ParseOption(optarg); + break; + case OPTION_ALLOW_CPU_METRICS: + lparams.allow_cpu_metrics_ = ParseOption(optarg); + break; + case OPTION_METRICS_ADDRESS: + lparams.metrics_address_ = optarg; + break; + case OPTION_METRICS_PORT: + lparams.metrics_port_ = ParseOption(optarg); + break; + case OPTION_METRICS_INTERVAL_MS: + lparams.metrics_interval_ms_ = ParseOption(optarg); + break; + case OPTION_METRICS_CONFIG: + lparams.metrics_config_settings_.push_back( + ParseMetricsConfigOption(optarg)); + break; +#endif // TRITON_ENABLE_METRICS + +#ifdef TRITON_ENABLE_TRACING + case OPTION_TRACE_FILEPATH: { + std::cerr << "Warning: '--trace-file' has been deprecated and will be" + " removed in future releases. Please use " + "'--trace-config triton,file= instead." + << std::endl; + trace_filepath_present = true; + lparams.trace_filepath_ = optarg; + break; + } + case OPTION_TRACE_LEVEL: { + std::cerr + << "Warning: '--trace-level' has been deprecated and will be" + " removed in future releases. Please use " + "'--trace-config level= instead." + << std::endl; + trace_level_present = true; + auto parsed_level = ParseTraceLevelOption(optarg); + explicit_disable_trace |= + (parsed_level == TRITONSERVER_TRACE_LEVEL_DISABLED); + lparams.trace_level_ = static_cast( + lparams.trace_level_ | parsed_level); + break; + } + case OPTION_TRACE_RATE: + std::cerr << "Warning: '--trace-rate' has been deprecated and will be" + " removed in future releases. Please use " + "'--trace-config rate= instead." + << std::endl; + trace_rate_present = true; + lparams.trace_rate_ = ParseOption(optarg); + break; + + case OPTION_TRACE_COUNT: + std::cerr + << "Warning: '--trace-count' has been deprecated and will be" + " removed in future releases. Please use " + "'--trace-config count= instead." + << std::endl; + trace_count_present = true; + lparams.trace_count_ = ParseOption(optarg); + break; + case OPTION_TRACE_LOG_FREQUENCY: + std::cerr + << "Warning: '--trace-log-frequency' has been deprecated and " + "will be" + " removed in future releases. Please use " + "'--trace-config triton,log-frequency= instead." + << std::endl; + trace_log_frequency_present = true; + lparams.trace_log_frequency_ = ParseOption(optarg); + break; + case OPTION_TRACE_CONFIG: { + auto trace_config_setting = ParseTraceConfigOption(optarg); + triton::server::TraceConfig& tc = + lparams + .trace_config_map_[std::get<0>(trace_config_setting).c_str()]; + tc.push_back(std::make_pair( + std::get<1>(trace_config_setting).c_str(), + std::get<2>(trace_config_setting).c_str())); + break; + } +#endif // TRITON_ENABLE_TRACING + + case OPTION_POLL_REPO_SECS: + lparams.repository_poll_secs_ = ParseOption(optarg); + break; + case OPTION_STARTUP_MODEL: + lparams.startup_models_.insert(optarg); + break; + case OPTION_MODEL_CONTROL_MODE: { + std::string mode_str(optarg); + std::transform( + mode_str.begin(), mode_str.end(), mode_str.begin(), ::tolower); + if (mode_str == "none") { + lparams.control_mode_ = TRITONSERVER_MODEL_CONTROL_NONE; + } else if (mode_str == "poll") { + lparams.control_mode_ = TRITONSERVER_MODEL_CONTROL_POLL; + } else if (mode_str == "explicit") { + lparams.control_mode_ = TRITONSERVER_MODEL_CONTROL_EXPLICIT; + } else { + throw ParseException("invalid argument for --model-control-mode"); + } + break; + } + case OPTION_RATE_LIMIT: { + std::string rate_limit_str(optarg); + std::transform( + rate_limit_str.begin(), rate_limit_str.end(), + rate_limit_str.begin(), ::tolower); + if (rate_limit_str == "execution_count") { + lparams.rate_limit_mode_ = TRITONSERVER_RATE_LIMIT_EXEC_COUNT; + } else if (rate_limit_str == "off") { + lparams.rate_limit_mode_ = TRITONSERVER_RATE_LIMIT_OFF; + } else { + throw ParseException("invalid argument for --rate-limit"); + } + break; + } + case OPTION_RATE_LIMIT_RESOURCE: { + std::string rate_limit_resource_str(optarg); + std::transform( + rate_limit_resource_str.begin(), rate_limit_resource_str.end(), + rate_limit_resource_str.begin(), ::tolower); + lparams.rate_limit_resources_.push_back( + ParseRateLimiterResourceOption(optarg)); + break; + } + case OPTION_PINNED_MEMORY_POOL_BYTE_SIZE: + lparams.pinned_memory_pool_byte_size_ = ParseOption(optarg); + break; + case OPTION_CUDA_MEMORY_POOL_BYTE_SIZE: + lparams.cuda_pools_.push_back( + ParsePairOption(optarg, ":")); + break; + case OPTION_CUDA_VIRTUAL_ADDRESS_SIZE: + lparams.cuda_virtual_address_size_.push_back( + ParsePairOption(optarg, ":")); + break; + case OPTION_RESPONSE_CACHE_BYTE_SIZE: { + cache_size_present = true; + const auto byte_size = std::to_string(ParseOption(optarg)); + lparams.cache_config_settings_["local"] = {{"size", byte_size}}; + std::cerr + << "Warning: '--response-cache-byte-size' has been deprecated! " + "This will default to the 'local' cache implementation with " + "the provided byte size for its config. Please use " + "'--cache-config' instead. The equivalent " + "--cache-config CLI args would be: " + "'--cache-config=local,size=" + + byte_size + "'" + << std::endl; + break; + } + case OPTION_CACHE_CONFIG: { + cache_config_present = true; + const auto cache_setting = ParseCacheConfigOption(optarg); + const auto& cache_name = std::get<0>(cache_setting); + const auto& key = std::get<1>(cache_setting); + const auto& value = std::get<2>(cache_setting); + lparams.cache_config_settings_[cache_name].push_back({key, value}); + break; + } + case OPTION_CACHE_DIR: + lparams.cache_dir_ = optarg; + break; + case OPTION_MIN_SUPPORTED_COMPUTE_CAPABILITY: + lparams.min_supported_compute_capability_ = + ParseOption(optarg); + break; + case OPTION_EXIT_TIMEOUT_SECS: + lparams.exit_timeout_secs_ = ParseOption(optarg); + break; + case OPTION_BACKEND_DIR: + lparams.backend_dir_ = optarg; + break; + case OPTION_REPOAGENT_DIR: + lparams.repoagent_dir_ = optarg; + break; + case OPTION_BUFFER_MANAGER_THREAD_COUNT: + lparams.buffer_manager_thread_count_ = ParseOption(optarg); + break; + case OPTION_MODEL_LOAD_THREAD_COUNT: + lparams.model_load_thread_count_ = ParseOption(optarg); + break; + case OPTION_BACKEND_CONFIG: + lparams.backend_config_settings_.push_back( + ParseBackendConfigOption(optarg)); + break; + case OPTION_HOST_POLICY: + lparams.host_policies_.push_back(ParseHostPolicyOption(optarg)); + break; + case OPTION_MODEL_LOAD_GPU_LIMIT: + lparams.load_gpu_limit_.emplace( + ParsePairOption(optarg, ":")); + break; + case OPTION_MODEL_NAMESPACING: + lparams.enable_model_namespacing_ = ParseOption(optarg); + break; + } + } + catch (const ParseException& pe) { + if ((pe.what() != NULL) && (strlen(pe.what()) != 0)) { + std::stringstream ss; + ss << "Bad option: \"--" << long_options[option_index].name << "\".\n" + << pe.what() << std::endl; + throw ParseException(ss.str()); + } else { + // In case of `Unrecognized option` or `Help` option, just throw a + // ParseException + throw ParseException(); + } + } + } + + if (optind < argc) { + throw ParseException(std::string("Unexpected argument: ") + argv[optind]); + } + + // + // Step 3. Post parsing validation, usually for options that depend on the + // others which are not determined until after parsing. + // + + if (lparams.control_mode_ != TRITONSERVER_MODEL_CONTROL_POLL) { + lparams.repository_poll_secs_ = 0; + } + +#ifdef TRITON_ENABLE_VERTEX_AI + // Set default model repository if specific flag is set, postpone the + // check to after parsing so we only monitor the default repository if + // Vertex service is allowed + if (lparams.model_repository_paths_.empty()) { + auto aip_storage_uri = + triton::server::GetEnvironmentVariableOrDefault("AIP_STORAGE_URI", ""); + if (!aip_storage_uri.empty()) { + lparams.model_repository_paths_.insert(aip_storage_uri); + } + } +#endif // TRITON_ENABLE_VERTEX_AI + +#ifdef TRITON_ENABLE_METRICS + lparams.allow_gpu_metrics_ &= lparams.allow_metrics_; + lparams.allow_cpu_metrics_ &= lparams.allow_metrics_; + // Set metrics_address to default if never specified + if (lparams.metrics_address_.empty()) { +#ifdef TRITON_ENABLE_HTTP + // If built with HTTP support, default to HTTP address + lparams.metrics_address_ = lparams.http_address_; +#else + // Otherwise have default for builds without HTTP support + lparams.metrics_address_ = "0.0.0.0"; +#endif // TRITON_ENABLE_HTTP + } +#endif // TRITON_ENABLE_METRICS + +#ifdef TRITON_ENABLE_TRACING + PostProcessTraceArgs( + lparams, trace_level_present, trace_rate_present, trace_count_present, + trace_filepath_present, trace_log_frequency_present, + explicit_disable_trace); +#endif // TRITON_ENABLE_TRACING + + // Check if there is a conflict between --disable-auto-complete-config + // and --strict-model-config + if (disable_auto_complete_config) { + if (strict_model_config_present && !lparams.strict_model_config_) { + std::cerr + << "Warning: Overriding deprecated '--strict-model-config' from " + "False to True in favor of '--disable-auto-complete-config'!" + << std::endl; + } + lparams.strict_model_config_ = true; + } + + // Check if there is a conflict between --response-cache-byte-size + // and --cache-config + if (cache_size_present && cache_config_present) { + throw ParseException( + "Error: Incompatible flags --response-cache-byte-size and " + "--cache-config both provided. Please provide one or the other."); + } + lparams.enable_cache_ = (cache_size_present || cache_config_present); + return {lparams, {}}; +} + +std::string +TritonParser::FormatUsageMessage(std::string str, int offset) +{ + int width = 60; + int current_pos = offset; + while (current_pos + width < int(str.length())) { + int n = str.rfind(' ', current_pos + width); + if (n != int(std::string::npos)) { + str.replace(n, 1, "\n\t"); + current_pos += (width + 9); + } + } + + return str; +} + +std::string +TritonParser::Usage() +{ + std::stringstream ss; + for (const auto& group : option_groups_) { + if (!group.first.empty() && !group.second.empty()) { + ss << std::endl << group.first << ":" << std::endl; + } + + for (const auto& o : group.second) { + if (!o.arg_desc_.empty()) { + ss << " --" << o.flag_ << " <" << o.arg_desc_ << ">" << std::endl + << "\t" << FormatUsageMessage(o.desc_, 0) << std::endl; + } else { + ss << " --" << o.flag_ << std::endl + << "\t" << FormatUsageMessage(o.desc_, 0) << std::endl; + } + } + } + return ss.str(); +} + +std::tuple +TritonParser::ParseMetricsConfigOption(const std::string& arg) +{ + // Format is "=" for generic configs/settings + int delim_setting = arg.find("="); + if (delim_setting < 0) { + std::stringstream ss; + ss << "--metrics-config option format is " + << "=. Got " << arg << std::endl; + throw ParseException(ss.str()); + } + + // Break section before "=" into substr to avoid matching commas + // in setting values. + auto name_substr = arg.substr(0, delim_setting); + int delim_name = name_substr.find(","); + + // No name-specific configs currently supported, though it may be in + // the future. Map global configs to empty string like other configs for + // now. + std::string name_string = std::string(); + if (delim_name >= 0) { + std::stringstream ss; + ss << "--metrics-config option format is " + << "=. Got " << arg << std::endl; + throw ParseException(ss.str()); + } // else global metrics config + + std::string setting_string = + arg.substr(delim_name + 1, delim_setting - delim_name - 1); + std::string value_string = arg.substr(delim_setting + 1); + + if (setting_string.empty() || value_string.empty()) { + std::stringstream ss; + ss << "--metrics-config option format is " + << "=. Got " << arg << std::endl; + throw ParseException(ss.str()); + } + + return {name_string, setting_string, value_string}; +} + +std::tuple +TritonParser::ParseCacheConfigOption(const std::string& arg) +{ + // Format is ",=" for specific + // config/settings and "=" for cache agnostic + // configs/settings + int delim_name = arg.find(","); + int delim_setting = arg.find("=", delim_name + 1); + + std::string name_string = std::string(); + if (delim_name > 0) { + name_string = arg.substr(0, delim_name); + } + // No cache-agnostic global settings are currently supported + else { + std::stringstream ss; + ss << "No cache specified. --cache-config option format is " + << ",=. Got " << arg << std::endl; + throw ParseException(ss.str()); + } + + if (delim_setting < 0) { + std::stringstream ss; + ss << "--cache-config option format is ',='. Got " + << arg << std::endl; + throw ParseException(ss.str()); + } + std::string setting_string = + arg.substr(delim_name + 1, delim_setting - delim_name - 1); + std::string value_string = arg.substr(delim_setting + 1); + + if (setting_string.empty() || value_string.empty()) { + std::stringstream ss; + ss << "--cache-config option format is ',='. Got " + << arg << std::endl; + throw ParseException(ss.str()); + } + + return {name_string, setting_string, value_string}; +} + +std::tuple +TritonParser::ParseRateLimiterResourceOption(const std::string& arg) +{ + std::string error_string( + "--rate-limit-resource option format is " + "'::' or ':'. " + "Got " + + arg); + + std::string name_string(""); + int count = -1; + int device_id = -1; + + size_t delim_first = arg.find(":"); + size_t delim_second = arg.find(":", delim_first + 1); + + if (delim_second != std::string::npos) { + // Handle format `::' + size_t delim_third = arg.find(":", delim_second + 1); + if (delim_third != std::string::npos) { + throw ParseException(error_string); + } + name_string = arg.substr(0, delim_first); + count = ParseOption( + arg.substr(delim_first + 1, delim_second - delim_first - 1)); + device_id = ParseOption(arg.substr(delim_second + 1)); + } else if (delim_first != std::string::npos) { + // Handle format `:' + name_string = arg.substr(0, delim_first); + count = ParseOption(arg.substr(delim_first + 1)); + } else { + // If no colons found + throw ParseException(error_string); + } + + return {name_string, count, device_id}; +} + +std::tuple +TritonParser::ParseBackendConfigOption(const std::string& arg) +{ + // Format is ",=" for specific + // config/settings and "=" for backend agnostic + // configs/settings + int delim_name = arg.find(","); + int delim_setting = arg.find("=", delim_name + 1); + + std::string name_string = std::string(); + if (delim_name > 0) { + name_string = arg.substr(0, delim_name); + } else if (delim_name == 0) { + std::stringstream ss; + ss << "No backend specified. --backend-config option format is " + << ",= or " + << "=. Got " << arg << std::endl; + throw ParseException(ss.str()); + } // else global backend config + + if (delim_setting < 0) { + std::stringstream ss; + ss << "--backend-config option format is ',='. Got " + << arg << std::endl; + throw ParseException(ss.str()); + } + std::string setting_string = + arg.substr(delim_name + 1, delim_setting - delim_name - 1); + std::string value_string = arg.substr(delim_setting + 1); + + if (setting_string.empty() || value_string.empty()) { + std::stringstream ss; + ss << "--backend-config option format is ',='. Got " + << arg << std::endl; + throw ParseException(ss.str()); + } + + return {name_string, setting_string, value_string}; +} + +void +TritonParser::ParseRestrictedFeatureOption( + const std::string& arg, const std::string& option_name, + const std::string& key_prefix, const std::string& feature_type, + RestrictedFeatures& restricted_features) +{ + const auto& parsed_tuple = + ParseGenericConfigOption(arg, ":", "=", option_name, "config name"); + + const auto& features = SplitOptions(std::get<0>(parsed_tuple), ","); + const auto& key = std::get<1>(parsed_tuple); + const auto& value = std::get<2>(parsed_tuple); + + for (const auto& feature : features) { + const auto& category = RestrictedFeatures::ToCategory(feature); + + if (category == RestrictedCategory::INVALID) { + std::stringstream ss; + ss << "unknown restricted " << feature_type << " '" << feature << "' " + << std::endl; + throw ParseException(ss.str()); + } + + if (restricted_features.IsRestricted(category)) { + // restricted feature can only be in one group + std::stringstream ss; + ss << "restricted " << feature_type << " '" << feature + << "' can not be specified in multiple config groups" << std::endl; + throw ParseException(ss.str()); + } + restricted_features.Insert( + category, std::make_pair(key_prefix + key, value)); + } +} + +std::tuple +TritonParser::ParseHostPolicyOption(const std::string& arg) +{ + return ParseGenericConfigOption(arg, ",", "=", "host-policy", "policy name"); +} + +std::tuple +TritonParser::ParseGenericConfigOption( + const std::string& arg, const std::string& first_delim, + const std::string& second_delim, const std::string& option_name, + const std::string& config_name) +{ + // Format is ",=" + int delim_name = arg.find(first_delim); + int delim_setting = arg.find(second_delim, delim_name + 1); + + std::string error_string = "--" + option_name + " option format is '<" + + config_name + ">" + first_delim + "" + + second_delim + "'. Got " + arg + "\n"; + + // Check for 2 semicolons + if ((delim_name < 0) || (delim_setting < 0)) { + throw ParseException(error_string); + } + + std::string name_string = arg.substr(0, delim_name); + std::string setting_string = + arg.substr(delim_name + 1, delim_setting - delim_name - 1); + std::string value_string = arg.substr(delim_setting + 1); + + if (name_string.empty() || setting_string.empty() || value_string.empty()) { + throw ParseException(error_string); + } + + return {name_string, setting_string, value_string}; +} + +#ifdef TRITON_ENABLE_TRACING +TRITONSERVER_InferenceTraceLevel +TritonParser::ParseTraceLevelOption(std::string arg) +{ + std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) { + return std::tolower(c); + }); + + if ((arg == "false") || (arg == "off")) { + return TRITONSERVER_TRACE_LEVEL_DISABLED; + } + if ((arg == "true") || (arg == "on") || (arg == "min") || (arg == "max") || + (arg == "timestamps")) { + return TRITONSERVER_TRACE_LEVEL_TIMESTAMPS; + } + if (arg == "tensors") { + return TRITONSERVER_TRACE_LEVEL_TENSORS; + } + + throw ParseException("invalid value for trace level option: " + arg); +} + +InferenceTraceMode +TritonParser::ParseTraceModeOption(std::string arg) +{ + std::transform(arg.begin(), arg.end(), arg.begin(), [](unsigned char c) { + return std::tolower(c); + }); + + if (arg == "triton") { + return TRACE_MODE_TRITON; + } + if (arg == "opentelemetry") { + return TRACE_MODE_OPENTELEMETRY; + } + + throw ParseException( + "invalid value for trace mode option: " + arg + + ". Available options are \"triton\" and \"opentelemetry\""); +} + +std::tuple +TritonParser::ParseTraceConfigOption(const std::string& arg) +{ + int delim_name = arg.find(","); + int delim_setting = arg.find("=", delim_name + 1); + + std::string name_string = std::string(); + if (delim_name > 0) { + name_string = + std::to_string(ParseTraceModeOption(arg.substr(0, delim_name))); + } else if (delim_name == 0) { + std::stringstream ss; + ss << "No trace mode specified. --trace-config option format is " + << ",= or " + << "=. Got " << arg << std::endl; + throw ParseException(ss.str()); + } // else global trace config + + if (delim_setting < 0) { + std::stringstream ss; + ss << "--trace-config option format is ',='. " + "Got " + << arg << std::endl; + throw ParseException(ss.str()); + } + std::string setting_string = + arg.substr(delim_name + 1, delim_setting - delim_name - 1); + std::string value_string = arg.substr(delim_setting + 1); + + if (setting_string.empty() || value_string.empty()) { + std::stringstream ss; + ss << "--trace-config option format is ',='. " + "Got " + << arg << std::endl; + throw ParseException(ss.str()); + } + + return {name_string, setting_string, value_string}; +} + +void +TritonParser::SetGlobalTraceArgs( + TritonServerParameters& lparams, bool trace_level_present, + bool trace_rate_present, bool trace_count_present, + bool explicit_disable_trace) +{ + for (const auto& global_setting : lparams.trace_config_map_[""]) { + try { + if (global_setting.first == "rate") { + if (trace_rate_present) { + std::cerr << "Warning: Overriding deprecated '--trace-rate' " + "in favor of provided rate value in --trace-config!" + << std::endl; + } + lparams.trace_rate_ = ParseOption(global_setting.second); + } + if (global_setting.first == "level") { + if (trace_level_present) { + std::cerr << "Warning: Overriding deprecated '--trace-level' " + "in favor of provided level in --trace-config!" + << std::endl; + } + auto parsed_level_config = ParseTraceLevelOption(global_setting.second); + explicit_disable_trace |= + (parsed_level_config == TRITONSERVER_TRACE_LEVEL_DISABLED); + lparams.trace_level_ = static_cast( + lparams.trace_level_ | parsed_level_config); + } + if (global_setting.first == "mode") { + lparams.trace_mode_ = ParseTraceModeOption(global_setting.second); + } + if (global_setting.first == "count") { + if (trace_count_present) { + std::cerr << "Warning: Overriding deprecated '--trace-count' " + "in favor of provided count in --trace-config!" + << std::endl; + } + lparams.trace_count_ = ParseOption(global_setting.second); + } + } + catch (const ParseException& pe) { + std::stringstream ss; + ss << "Bad option: \"--trace-config " << global_setting.first << "\".\n" + << pe.what() << std::endl; + throw ParseException(ss.str()); + } + } +} + +void +TritonParser::SetTritonTraceArgs( + TritonServerParameters& lparams, bool trace_filepath_present, + bool trace_log_frequency_present) +{ + for (const auto& mode_setting : + lparams.trace_config_map_[std::to_string(TRACE_MODE_TRITON)]) { + try { + if (mode_setting.first == "file") { + if (trace_filepath_present) { + std::cerr << "Warning: Overriding deprecated '--trace-file' " + "in favor of provided file in --trace-config!" + << std::endl; + } + lparams.trace_filepath_ = mode_setting.second; + } else if (mode_setting.first == "log-frequency") { + if (trace_log_frequency_present) { + std::cerr << "Warning: Overriding deprecated '--trace-file' " + "in favor of provided file in --trace-config!" + << std::endl; + } + lparams.trace_log_frequency_ = ParseOption(mode_setting.second); + } + } + catch (const ParseException& pe) { + std::stringstream ss; + ss << "Bad option: \"--trace-config triton," << mode_setting.first + << "\".\n" + << pe.what() << std::endl; + throw ParseException(ss.str()); + } + } +} + +void +TritonParser::VerifyOpentelemetryTraceArgs( + bool trace_filepath_present, bool trace_log_frequency_present) +{ + if (trace_filepath_present) { + std::cerr << "Warning: '--trace-file' is deprecated and will " + "be ignored with opentelemetry tracing mode. " + << std::endl; + } + if (trace_log_frequency_present) { + std::cerr << "Warning: '--trace-log-frequency' is deprecated " + "and will be ignored with opentelemetry tracing mode." + << std::endl; + } +} + +void +TritonParser::PostProcessTraceArgs( + TritonServerParameters& lparams, bool trace_level_present, + bool trace_rate_present, bool trace_count_present, + bool trace_filepath_present, bool trace_log_frequency_present, + bool explicit_disable_trace) +{ + SetGlobalTraceArgs( + lparams, trace_level_present, trace_rate_present, trace_count_present, + explicit_disable_trace); + + if (lparams.trace_mode_ == TRACE_MODE_OPENTELEMETRY) { + VerifyOpentelemetryTraceArgs( + trace_filepath_present, trace_log_frequency_present); + } else if (lparams.trace_mode_ == TRACE_MODE_TRITON) { + SetTritonTraceArgs( + lparams, trace_filepath_present, trace_log_frequency_present); + } + + if (explicit_disable_trace) { + lparams.trace_level_ = TRITONSERVER_TRACE_LEVEL_DISABLED; + } +} + +#endif // TRITON_ENABLE_TRACING +}} // namespace triton::server diff --git a/src/command_line_parser.h b/src/command_line_parser.h new file mode 100644 index 0000000000..ef562a3efb --- /dev/null +++ b/src/command_line_parser.h @@ -0,0 +1,345 @@ +// Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// +// Redistribution and use in source and binary forms, with or without +// modification, are permitted provided that the following conditions +// are met: +// * Redistributions of source code must retain the above copyright +// notice, this list of conditions and the following disclaimer. +// * Redistributions in binary form must reproduce the above copyright +// notice, this list of conditions and the following disclaimer in the +// documentation and/or other materials provided with the distribution. +// * Neither the name of NVIDIA CORPORATION nor the names of its +// contributors may be used to endorse or promote products derived +// from this software without specific prior written permission. +// +// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +// +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "restricted_features.h" +#include "triton/common/logging.h" +#include "triton/core/tritonserver.h" +#ifdef TRITON_ENABLE_GRPC +// To avoid ambiguous reference during build +// grpc headers should be imported first +// https://github.com/open-telemetry/opentelemetry-cpp/blob/main/examples/otlp/README.md#additional-notes-regarding-abseil-library +#include "grpc/grpc_server.h" +#endif // TRITON_ENABLE_GRPC +#if defined(TRITON_ENABLE_HTTP) || defined(TRITON_ENABLE_METRICS) +#include "http_server.h" +#endif // TRITON_ENABLE_HTTP || TRITON_ENABLE_METRICS +#ifdef TRITON_ENABLE_SAGEMAKER +#include "sagemaker_server.h" +#endif // TRITON_ENABLE_SAGEMAKER +#ifdef TRITON_ENABLE_VERTEX_AI +#include "vertex_ai_server.h" +#endif // TRITON_ENABLE_VERTEX_AI + +#ifndef _WIN32 +#include +#include +#else +// Minimum implementation of for Windows +#define required_argument 1 +#define no_argument 2 +struct option { + option(const char* name, int has_arg, int* flag, int val) + : name(name), has_arg(has_arg), flag(flag), val(val) + { + } + const char* name; + int has_arg; + int* flag; + int val; +}; +#endif +#ifdef TRITON_ENABLE_TRACING +#include "tracer.h" +#endif + + +namespace triton { namespace server { + +// Command-line options +struct Option { + static constexpr const char* ArgNone = ""; + static constexpr const char* ArgBool = "boolean"; + static constexpr const char* ArgFloat = "float"; + static constexpr const char* ArgInt = "integer"; + static constexpr const char* ArgStr = "string"; + + Option(int id, std::string flag, std::string arg_desc, std::string desc) + : id_(id), flag_(flag), arg_desc_(arg_desc), desc_(desc) + { + } + + struct option GetLongOption() const + { + struct option lo { + flag_.c_str(), (!arg_desc_.empty()) ? required_argument : no_argument, + nullptr, id_ + }; + return lo; + } + + const int id_; + const std::string flag_; + const std::string arg_desc_; + const std::string desc_; +}; + +struct TritonServerParameters { + std::string server_id_{"triton"}; + bool exit_on_error_{true}; + bool strict_model_config_{false}; + bool strict_readiness_{true}; + int32_t exit_timeout_secs_{30}; +#ifdef TRITON_ENABLE_GPU + double min_supported_compute_capability_{TRITON_MIN_COMPUTE_CAPABILITY}; +#else + double min_supported_compute_capability_{0.0}; +#endif // TRITON_ENABLE_GPU + std::string repoagent_dir_{"/opt/tritonserver/repoagents"}; + std::string backend_dir_{"/opt/tritonserver/backends"}; + std::vector> + backend_config_settings_; + + // Model repository manager configuration + bool enable_model_namespacing_{false}; + std::set model_repository_paths_{}; + TRITONSERVER_ModelControlMode control_mode_{TRITONSERVER_MODEL_CONTROL_NONE}; + std::set startup_models_{}; + // Interval, in seconds, when the model repository is polled for changes. + int32_t repository_poll_secs_{15}; + // Number of threads to use for concurrently loading models + uint32_t model_load_thread_count_{4}; + std::map load_gpu_limit_; + + // Rate limiter configuration + // FIXME: Once the rate limiter implementation is complete make + // EXEC_COUNT the default. + // TRITONSERVER_RateLimitMode + // rate_limit_mode_{TRITONSERVER_RATE_LIMIT_EXEC_COUNT}; + TRITONSERVER_RateLimitMode rate_limit_mode_{TRITONSERVER_RATE_LIMIT_OFF}; + std::vector> rate_limit_resources_; + + // memory pool configuration + int64_t pinned_memory_pool_byte_size_{1 << 28}; + std::list> cuda_pools_; + std::list> cuda_virtual_address_size_; + + // [FIXME] this option is broken after backend separation: this should have + // controlled backend copy behavior but not properly propagate to backend + // after separation, need to go through backend config. + int32_t buffer_manager_thread_count_{0}; + + std::vector> host_policies_; + + // Cache configuration + bool enable_cache_{false}; + std::string cache_dir_{"/opt/tritonserver/caches"}; + std::unordered_map< + std::string, std::vector>> + cache_config_settings_; + +#ifdef TRITON_ENABLE_LOGGING + bool log_info_{true}; + bool log_warn_{true}; + bool log_error_{true}; + int32_t log_verbose_{0}; + triton::common::Logger::Format log_format_{ + triton::common::Logger::Format::kDEFAULT}; + std::string log_file_{}; +#endif // TRITON_ENABLE_LOGGING + +#ifdef TRITON_ENABLE_TRACING + std::string trace_filepath_{}; + TRITONSERVER_InferenceTraceLevel trace_level_{ + TRITONSERVER_TRACE_LEVEL_DISABLED}; + int32_t trace_rate_{1000}; + int32_t trace_count_{-1}; + int32_t trace_log_frequency_{0}; + InferenceTraceMode trace_mode_{TRACE_MODE_TRITON}; + TraceConfigMap trace_config_map_; +#endif // TRITON_ENABLE_TRACING + +// The configurations for various endpoints (i.e. HTTP, GRPC and metrics) +#ifdef TRITON_ENABLE_HTTP + bool allow_http_{true}; + std::string http_address_{"0.0.0.0"}; + int32_t http_port_{8000}; + bool reuse_http_port_{false}; + std::string http_forward_header_pattern_; + // The number of threads to initialize for the HTTP front-end. + int http_thread_cnt_{8}; + RestrictedFeatures http_restricted_apis_{}; +#endif // TRITON_ENABLE_HTTP + +#ifdef TRITON_ENABLE_GRPC + bool allow_grpc_{true}; + triton::server::grpc::Options grpc_options_; +#endif // TRITON_ENABLE_GRPC + +#ifdef TRITON_ENABLE_METRICS + bool allow_metrics_{true}; + // Defaults to http_address_ if TRITON_ENABLE_HTTP is enabled for backwards, + // otherwise defaults to "0.0.0.0" for TRITON_ENABLE_HTTP is disabled. + std::string metrics_address_{""}; + int32_t metrics_port_{8002}; + // Metric settings for Triton core + float metrics_interval_ms_{2000}; + bool allow_gpu_metrics_{true}; + bool allow_cpu_metrics_{true}; + std::vector> + metrics_config_settings_; +#endif // TRITON_ENABLE_METRICS + +#ifdef TRITON_ENABLE_SAGEMAKER + bool allow_sagemaker_{false}; + std::string sagemaker_address_{"0.0.0.0"}; + int32_t sagemaker_port_{8080}; + bool sagemaker_safe_range_set_{false}; + std::pair sagemaker_safe_range_{-1, -1}; + // The number of threads to initialize for the SageMaker HTTP front-end. + int sagemaker_thread_cnt_{8}; +#endif // TRITON_ENABLE_SAGEMAKER + +#ifdef TRITON_ENABLE_VERTEX_AI + bool allow_vertex_ai_{false}; + std::string vertex_ai_address_{"0.0.0.0"}; + int32_t vertex_ai_port_{8080}; + // The number of threads to initialize for the Vertex AI HTTP front-end. + int vertex_ai_thread_cnt_{8}; + std::string vertex_ai_default_model_{}; +#endif // TRITON_ENABLE_VERTEX_AI + + // [FIXME] who should call this function? + void CheckPortCollision(); + using ManagedTritonServerOptionPtr = std::unique_ptr< + TRITONSERVER_ServerOptions, decltype(&TRITONSERVER_ServerOptionsDelete)>; + ManagedTritonServerOptionPtr BuildTritonServerOptions(); +}; + +// Exception type to be thrown if the error is parsing related +class ParseException : public std::exception { + public: + ParseException() = default; + ParseException(const std::string& message) : message_(message) {} + + virtual const char* what() const throw() { return message_.c_str(); } + + private: + const std::string message_{""}; +}; + +// [WIP] Fall-through parser, Parse() will convert the recognized options into +// parameter object and return the unrecognized options to be another argument +// list for other parser to consume. +// This allows the composition of parser chain. +// [FIXME] abstract interface, concrete class below should only parse Triton +// core and endpoint control options (endpoint specific options in their own +// parser) +class TritonParser { + public: + TritonParser(); + // Parse command line arguments into a parameters struct and transform + // the argument list to contain only unrecognized options. The content of + // unrecognized argument list shares the same lifecycle as 'argv'. + // Raise ParseException if fail to parse recognized options. + std::pair> Parse( + int argc, char** argv); + + // Return usage of all recognized options + std::string Usage(); + + private: + std::string FormatUsageMessage(std::string str, int offset); + // Helper functions for parsing options that require multi-value parsing. + std::tuple ParseCacheConfigOption( + const std::string& arg); + std::tuple ParseRateLimiterResourceOption( + const std::string& arg); + std::tuple ParseBackendConfigOption( + const std::string& arg); + std::tuple ParseHostPolicyOption( + const std::string& arg); + std::tuple ParseMetricsConfigOption( + const std::string& arg); + void ParseRestrictedFeatureOption( + const std::string& arg, const std::string& option_name, + const std::string& header_prefix, const std::string& feature_type, + RestrictedFeatures& restricted_features); +#ifdef TRITON_ENABLE_TRACING + TRITONSERVER_InferenceTraceLevel ParseTraceLevelOption(std::string arg); + InferenceTraceMode ParseTraceModeOption(std::string arg); + std::tuple ParseTraceConfigOption( + const std::string& arg); + // Helper functions for post processing for collected trace arguments. + void SetGlobalTraceArgs( + TritonServerParameters& lparams, bool trace_level_present, + bool trace_rate_present, bool trace_count_present, + bool explicit_disable_trace); + void SetTritonTraceArgs( + TritonServerParameters& lparams, bool trace_filepath_present, + bool trace_log_frequency_present); + void VerifyOpentelemetryTraceArgs( + bool trace_filepath_present, bool trace_log_frequency_present); + void PostProcessTraceArgs( + TritonServerParameters& lparams, bool trace_level_present, + bool trace_rate_present, bool trace_count_present, + bool trace_filepath_present, bool trace_log_frequency_present, + bool explicit_disable_trace); +#endif // TRITON_ENABLE_TRACING + // Helper function to parse option in + // "[1st_delim][2nd_delim]" format + std::tuple ParseGenericConfigOption( + const std::string& arg, const std::string& first_delim, + const std::string& second_delim, const std::string& option_name, + const std::string& config_name); + + // Initialize individual option groups + void SetupOptions(); + // Initialize option group mappings + void SetupOptionGroups(); + + // Sum of option groups: vector to maintain insertion order for Usage() + std::vector&>> option_groups_; + // Individual option groups + std::vector