Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add testing for Pytorch instance group kind MODEL #5810

Merged
merged 9 commits into from
Jun 13, 2023
Merged

Conversation

krishung5
Copy link
Contributor

@krishung5 krishung5 commented May 18, 2023

@krishung5
Copy link
Contributor Author

Unfortunately torch._C._cuda_sleep is not supported in TorchScript:

Python builtin <built-in function _cuda_sleep> is currently not supported in Torchscript:
  File "/opt/tritonserver/qa/L0_libtorch_instance_group_kind_model/gen_models.py", line 37
    def forward(self, x):
        torch._C._cuda_sleep(10)
        ~~~~~~~~~~~~~~~~~~~~ <--- HERE
        return torch.sum(x, dim=1

I couldn't find a way to make a torch model sleep and occupy on GPU as well.

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing looks good, could you also add the same model in L0_io as well?

@@ -47,13 +47,11 @@ MODELSDIR=`pwd`/models
DATADIR=/data/inferenceserver/${REPO_VERSION}/qa_model_repository
ENSEMBLEDIR=/data/inferenceserver/${REPO_VERSION}/qa_ensemble_model_repository/qa_model_repository

export CUDA_VISIBLE_DEVICES=0,1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tabrizian This line was the reason for the strange behavior we saw.. Removed this as we already export cuda devices at line 41 above.

@krishung5 krishung5 force-pushed the krish-pytorch branch 2 times, most recently from 81a6593 to 4765886 Compare June 9, 2023 21:57
@krishung5
Copy link
Contributor Author

@Tabrizian Please let me know if the testing looks good or if there is any other test case we need to cover, thank you!

self.device = device

def forward(self, INPUT0, INPUT1):
INPUT0 = INPUT0.to(self.device)
Copy link
Contributor

@tanmayv25 tanmayv25 Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ensure that execution will also occur on the same device? Is it possible that execution kernels to be invoked on the default device access the tensors via p2p?
It is fine for now, but can you verify this using nsight traces that kernels are being launched on both the devices.

Copy link
Contributor Author

@krishung5 krishung5 Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran PA on the model that uses device 0 and device 2, and the output of nvidia-smi showed that only the two devices were used:

krish@nvdl-a112-asus01:/$ nvidia-smi
Mon Jun 12 13:56:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:01:00.0 Off |                    0 |
|  0%   46C    P0    81W / 300W |    730MiB / 46068MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:41:00.0 Off |                    0 |
|  0%   45C    P0    79W / 300W |    728MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:81:00.0 Off |                    0 |
|  0%   44C    P0    78W / 300W |    730MiB / 46068MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:C1:00.0 Off |                    0 |
|  0%   45C    P0    80W / 300W |    730MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11479      C   tritonserver                      728MiB |
|    1   N/A  N/A     11479      C   tritonserver                      726MiB |
|    2   N/A  N/A     11479      C   tritonserver                      728MiB |
|    3   N/A  N/A     11479      C   tritonserver                      728MiB |
+-----------------------------------------------------------------------------+

Is it possible that execution kernels to be invoked on the default device access the tensors via p2p?

I think the default device should be used for accessing the tensor. We can see that the default device, which is GPU 0, has more utilization. Also, for a model that uses CPU and device 3, we can see that device 0 is used as well:

krish@nvdl-a112-asus01:/$ nvidia-smi
Mon Jun 12 13:58:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:01:00.0 Off |                    0 |
|  0%   47C    P0    81W / 300W |    730MiB / 46068MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:41:00.0 Off |                    0 |
|  0%   46C    P0    80W / 300W |    728MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:81:00.0 Off |                    0 |
|  0%   45C    P0    78W / 300W |    730MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:C1:00.0 Off |                    0 |
|  0%   46C    P0    81W / 300W |    730MiB / 46068MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11479      C   tritonserver                      728MiB |
|    1   N/A  N/A     11479      C   tritonserver                      726MiB |
|    2   N/A  N/A     11479      C   tritonserver                      728MiB |
|    3   N/A  N/A     11479      C   tritonserver                      728MiB |
+-----------------------------------------------------------------------------+

Let me try to get the nsight traces. Had some issue with nsight hanging when generating the report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved the nsight hanging issue and finally got the traces. I used the libtorch_multi_gpu testing model that used device 0 and device 2. From the traces we can see that the kernels were launched on both of device 0 and device 2.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming Kris!

@krishung5 krishung5 merged commit cfa1efe into main Jun 13, 2023
@krishung5 krishung5 deleted the krish-pytorch branch June 13, 2023 05:43
krishung5 added a commit that referenced this pull request Jun 13, 2023
* Add testing for Pytorch instance group kind MODEL

* Remove unused item

* Update testing to verify the infer result

* Add copyright

* Remove unused import

* Update pip install

* Update the model to use the same add sub logic

* Add torch multi-gpu and multi-device models to L0_io

* Fix up model version
mc-nv pushed a commit that referenced this pull request Jun 13, 2023
* Add testing for Pytorch instance group kind MODEL

* Remove unused item

* Update testing to verify the infer result

* Add copyright

* Remove unused import

* Update pip install

* Update the model to use the same add sub logic

* Add torch multi-gpu and multi-device models to L0_io

* Fix up model version
suraj-vathsa added a commit to verkada/triton-inference-server that referenced this pull request Dec 15, 2023
* Changed copyright (triton-inference-server#5705)

* Modify timeout test in L0_sequence_batcher to use portable backend (triton-inference-server#5696)

* Modify timeout test in L0_sequence_batcher to use portable backend

* Use identity backend that is built by default on Windows

* updated upstream container name (triton-inference-server#5713)

* Fix triton container version (triton-inference-server#5714)

* Update the L0_model_config test expected error message (triton-inference-server#5684)

* Use better value in timeout test L0_sequence_batcher (triton-inference-server#5716)

* Use better value in timeout test L0_sequence_batcher

* Format

* Update JAX install (triton-inference-server#5613)

* Add notes about socket usage to L0_client_memory_growth test (triton-inference-server#5710)

* Check TensorRT error message more granularly (triton-inference-server#5719)

* Check TRT err msg more granularly

* Clarify source of error messages

* Consolidate tests for message parts

* Pin Python Package Versions for HTML Document Generation (triton-inference-server#5727)

* updating with pinned versions for python dependencies

* updated with pinned sphinx and nbclient versions

* Test full error returned when custom batcher init fails (triton-inference-server#5729)

* Add testing for batcher init failure, add wait for status check

* Formatting

* Change search string

* Add fastertransformer test  (triton-inference-server#5500)

Add fastertransformer test that uses 1GPU.

* Fix L0_backend_python on Jetson  (triton-inference-server#5728)

* Don't use mem probe in Jetson

* Clarify failure messages in L0_backend_python

* Update copyright

* Add JIRA ref, fix _test_jetson

* Add testing for Python custom metrics API (triton-inference-server#5669)

* Add testing for python custom metrics API

* Add custom metrics example to the test

* Fix for CodeQL report

* Fix test name

* Address comment

* Add logger and change the enum usage

* Add testing for Triton Client Plugin API (triton-inference-server#5706)

* Add HTTP client plugin test

* Add testing for HTTP asyncio

* Add async plugin support

* Fix qa container for L0_grpc

* Add testing for grpc client plugin

* Remove unused imports

* Fix up

* Fix L0_grpc models QA folder

* Update the test based on review feedback

* Remove unused import

* Add testing for .plugin method

* Install jemalloc (triton-inference-server#5738)

* Add --metrics-address and testing (triton-inference-server#5737)

* Add --metrics-address, add tests to L0_socket, re-order CLI options for consistency

* Use non-localhost address

* Add testing for basic auth plugin for HTTP/gRPC clients (triton-inference-server#5739)

* Add HTTP basic auth test

* Add testing for gRPC basic auth

* Fix up

* Remove unused imports

* Add multi-gpu, multi-stream testing for dlpack tensors (triton-inference-server#5550)

* Add multi-gpu, multi-stream testing for dlpack tensors

* Update note on SageMaker MME support for ensemble (triton-inference-server#5723)

* Run L0_backend_python subtests with virtual environment (triton-inference-server#5753)

* Update 'main' to track development of 2.35.0 / r23.06 (triton-inference-server#5764)

* Include jemalloc into the documentation (triton-inference-server#5760)

* Enhance tests in L0_model_update (triton-inference-server#5724)

* Add model instance name update test

* Add gap for timestamp to update

* Add some tests with dynamic batching

* Extend supported test on rate limit off

* Continue test if off mode failed

* Fix L0_memory_growth (triton-inference-server#5795)

(1) reduce MAX_ALLOWED_ALLOC to be more strict for bounded tests, and generous for unbounded tests.
(2) allow unstable measurement from PA.
(3) improve logging for future triage

* Add note on --metrics-address (triton-inference-server#5800)

* Add note on --metrics-address

* Copyright

* Minor fix for running "mlflow deployments create -t triton --flavor triton ..." (triton-inference-server#5658)

UnboundLocalError: local variable 'meta_dict' referenced before assignment

The above error shows in listing models in Triton model repository

* Adding test for new sequence mode (triton-inference-server#5771)

* Adding test for new sequence mode

* Update option name

* Clean up testing spacing and new lines

* MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL (triton-inference-server#5686)

* MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Update the function order of config.py and use os.path.join to replace filtering a list of strings then joining

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Update onnx flavor to support s3 prefix and custom endpoint URL

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Fix two typos in MLFlow Triton plugin README.md

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments (replace => strip)

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments (init regex only for s3)

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Remove unused local variable: slash_locations

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Fix client script (triton-inference-server#5806)

* Add MLFlow test for already loaded models. Update copyright year (triton-inference-server#5808)

* Use the correct gtest filter (triton-inference-server#5824)

* Add error message test on S3 access decline (triton-inference-server#5825)

* Add test on access decline

* Fix typo

* Add MinIO S3 access decline test

* Make sure bucket exists during access decline test

* Restore AWS_SECRET_ACCESS_KEY on S3 local test (triton-inference-server#5832)

* Restore AWS_SECRET_ACCESS_KEY

* Add reason for restoring keys

* nnshah1 stream infer segfault fix (triton-inference-server#5842)

match logic from infer_handler.cc

* Remove unused test (triton-inference-server#5851)

* Add and document memory usage in statistic protocol (triton-inference-server#5642)

* Add and document memory usage in statistic protocol

* Fix doc

* Fix up

* [DO NOT MERGE Add test. FIXME: model generation

* Fix up

* Fix style

* Address comment

* Fix up

* Set memory tracker backend option in build.py

* Fix up

* Add CUPTI library in Windows image build

* Add note to build with memory tracker by default

* use correct lib dir on CentOS (triton-inference-server#5836)

* use correct lib dir on CentOS

* use new location for opentelemetry-cpp

* Document that gpu-base flag is optional for cpu-only builds (triton-inference-server#5861)

* Update Jetson tests in Docker container (triton-inference-server#5734)

* Add flags for ORT build

* Separate list with commas

* Remove unnecessary detection of nvcc compiler

* Fixed Jetson path for perf_client, datadir

* Create version directoryy for custom model

* Remove probe check for shm, add shm exceed error for Jetson

* Copyright updates, fix Jetson Probe

* Fix be_python test num on Jetson

* Remove extra comma, non-Dockerized Jetson comment

* Remove comment about Jetson being non-dockerized

* Remove no longer needed flag

* Update `main` post-23.05 release (triton-inference-server#5880)

* Update README and versions for 23.05 branch

* Changes to support 23.05 (triton-inference-server#5782)

* Update python and conda version

* Update CMAKE installation

* Update checksum version

* Update ubuntu base image to 22.04

* Use ORT 1.15.0

* Set CMAKE to pull latest version

* Update libre package version

* Removing unused argument

* Adding condition for ubuntu 22.04

* Removing installation of the package from the devel container

* Nnshah1 u22.04 (triton-inference-server#5770)

* Update CMAKE installation

* Update python and conda version

* Update CMAKE installation

* Update checksum version

* Update ubuntu base image to 22.04

* updating versions for ubuntu 22.04

* remove re2

---------

Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Set ONNX version to 1.13.0

* Fix L0_custom_ops for ubuntu 22.04 (triton-inference-server#5775)

* add back rapidjson-dev

---------

Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com>

* Fix L0_mlflow (triton-inference-server#5805)

* working thread

* remove default install of blinker

* merge issue fixed

* Fix L0_backend_python/env test (triton-inference-server#5799)

* Fix L0_backend_python/env test

* Address comment

* Update the copyright

* Fix up

* Fix L0_http_fuzz (triton-inference-server#5776)

* installing python 3.8.16 for test

* spelling

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* use util functions to install python3.8 in an easier way

---------

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update Windows versions for 23.05 release (triton-inference-server#5826)

* Rename Ubuntu 20.04 mentions to 22.04 (triton-inference-server#5849)

* Update DCGM version (triton-inference-server#5856)

* Update DCGM version (triton-inference-server#5857)

* downgrade DCGM version to 2.4.7 (triton-inference-server#5860)

* Updating link for latest release notes to 23.05

---------

Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Disable memory tracker on Jetpack until the library is available (triton-inference-server#5882)

* Fix datadir for x86 (triton-inference-server#5894)

* Add more test on instance signature (triton-inference-server#5852)

* Add testing for new error handling API (triton-inference-server#5892)

* Test batch input for libtorch (triton-inference-server#5855)

* Draft ragged TensorRT unit model gen

* Draft libtorch special identity model

* Autoformat

* Update test, fix ragged model gen

* Update suffix for io for libtorch

* Remove unused variables

* Fix io names for libtorch

* Use INPUT0/OUTPUT0 for libtorch

* Reorder to match test model configs

* Remove unnecessary capitalization

* Auto-format

* Capitalization is necessary

* Remove unnecessary export

* Clean up Azure dependency in server build (triton-inference-server#5900)

* [DO NOT MERGE]

* Remove Azure dependency in server component build

* Finalize

* Fix dependency

* Fixing up

* Clean up

* Add response parameters for streaming GRPC inference to enhance decoupled support (triton-inference-server#5878)

* Update 'main' to track development of 2.36.0 / 23.07 (triton-inference-server#5917)

* Add test for detecting S3 http2 upgrade request (triton-inference-server#5911)

* Add test for detecting S3 http2 upgrade request

* Enhance testing

* Copyright year update

* Add Redis cache build, tests, and docs (triton-inference-server#5916)

* Updated handling for uint64 request priority

* Ensure HPCX dependencies found in container (triton-inference-server#5922)

* Add HPCX dependencies to search path

* Copy hpcx to CPU-only container

* Add ucc path to CPU-only image

* Fixed if statement

* Fix df variable

* Combine hpcx LD_LIBRARY_PATH

* Add test case where MetricFamily is deleted before deleting Metric (triton-inference-server#5915)

* Add test case for metric lifetime error handling

* Address comment

* Use different MetricFamily name

* Add testing for Pytorch instance group kind MODEL (triton-inference-server#5810)

* Add testing for Pytorch instance group kind MODEL

* Remove unused item

* Update testing to verify the infer result

* Add copyright

* Remove unused import

* Update pip install

* Update the model to use the same add sub logic

* Add torch multi-gpu and multi-device models to L0_io

* Fix up model version

* Add test for sending instance update config via load API (triton-inference-server#5937)

* Add test for passing config via load api

* Add more docs on instance update behavior

* Update to suggested docs

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Use dictionary for json config

* Modify the config fetched from Triton instead

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Fix L0_batcher count check (triton-inference-server#5939)

* Add testing for json tensor format (triton-inference-server#5914)

* Add redis config and use local logfile for redis server (triton-inference-server#5945)

* Add redis config and use local logfile for redis server

* Move redis log config to CLI

* Have separate redis logs for unit tests and CLI tests

* Add test on rate limiter max resource decrease update (triton-inference-server#5885)

* Add test on rate limiter max resource decrease update

* Add test with explicit resource

* Check server log for decreased resource limit

* Add docs on decoupled final response feature (triton-inference-server#5936)

* Allow changing ping behavior based on env variable in SageMaker and entrypoint updates (triton-inference-server#5910)

* Allow changing ping behavior based on env variable in SageMaker

* Add option for additional args

* Make ping further configurable

* Allow further configuration of grpc and http ports

* Update docker/sagemaker/serve

* Update docker/sagemaker/serve

---------

Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>

* Remove only MPI libraries in HPCX in L0_perf_analyzer (triton-inference-server#5967)

* Be more specific with MPI removal

* Delete all libmpi libs

* Ensure L0_batch_input requests received in order (triton-inference-server#5963)

* Add print statements for debugging

* Add debugging print statements

* Test using grpc client with stream to fix race

* Use streaming client in all non-batch tests

* Switch all clients to streaming GRPC

* Remove unused imports, vars

* Address comments

* Remove random comment

* Set inputs as separate function

* Split set inputs based on test type

* Add test for redis cache auth credentials via env vars (triton-inference-server#5966)

* Auto-formatting (triton-inference-server#5979)

* Auto-format

* Change to clang-format-15 in CONTRIBTUING

* Adding tests ensuring locale setting is passed to python backend interpreter

* Refactor build.py CPU-only Linux libs for readability (triton-inference-server#5990)

* Improve the error message when the number of GPUs is insufficient (triton-inference-server#5993)

* Update README to include CPP-API Java Bindings (triton-inference-server#5883)

* Update env variable to use for overriding /ping behavior (triton-inference-server#5994)

* Add test that >1000 model files can be loaded in S3 (triton-inference-server#5976)

* Add test for >1000 files

* Capitalization for consistency

* Add bucket cleaning at end

* Move test pass/fail to end

* Check number of files in model dir at load time

* Add testing for GPU tensor error handling (triton-inference-server#5871)

* Add testing for GPU tensor error handling

* Fix up

* Remove exit 0

* Fix jetson

* Fix up

* Add test for Python BLS model loading API (triton-inference-server#5980)

* Add test for Python BLS model loading API

* Fix up

* Update README and versions for 23.06 branch

* Fix LD_LIBRARY_PATH for PyTorch backend

* Return updated df in add_cpu_libs

* Remove unneeded df param

* Update test failure messages to match Dataloader changes (triton-inference-server#6006)

* Add dependency for L0_python_client_unit_tests (triton-inference-server#6010)

* Improve performance tuning guide (triton-inference-server#6026)

* Enabling nested spans for trace mode OpenTelemetry (triton-inference-server#5928)

* Adding nested spans to OTel tracing + support of ensemble models

* Move multi-GPU dlpack test to a separate L0 test (triton-inference-server#6001)

* Move multi-GPU dlpack test to a separate L0 test

* Fix copyright

* Fix up

* OpenVINO 2023.0.0 (triton-inference-server#6031)

* Upgrade OV to 2023.0.0

* Upgrade OV model gen script to 2023.0.0

* Add test to check the output memory type for onnx models (triton-inference-server#6033)

* Add test to check the output memory type for onnx models

* Remove unused import

* Address comment

* Add testing for implicit state for PyTorch backend (triton-inference-server#6016)

* Add testing for implicit state for PyTorch backend

* Add testing for libtorch string implicit models

* Fix CodeQL

* Mention that libtorch backend supports implicit state

* Fix CodeQL

* Review edits

* Fix output tests for PyTorch backend

* Allow uncompressed conda execution enviroments (triton-inference-server#6005)

Add test for uncompressed conda execution enviroments

* Fix implicit state test (triton-inference-server#6039)

* Adding target_compile_features cxx_std_17 to tracing lib (triton-inference-server#6040)

* Update 'main' to track development of 2.37.0 / 23.08

* Fix intermittent failure in L0_model_namespacing (triton-inference-server#6052)

* Fix PyTorch implicit model mounting in gen_qa_model_repository (triton-inference-server#6054)

* Fix broken links pointing to the `grpc_server.cc` file (triton-inference-server#6068)

* Fix L0_backend_python expected instance name (triton-inference-server#6073)

* Fix expected instance name

* Copyright year

* Fix L0_sdk: update the search name for the client wheel (triton-inference-server#6074)

* Fix name of client wheel to be looked for

* Fix up

* Add GitHub action to format and lint code (triton-inference-server#6022)

* Add pre-commit

* Fix typos, exec/shebang, formatting

* Remove clang-format

* Update contributing md to include pre-commit

* Update spacing in CONTRIBUTING

* Fix contributing pre-commit link

* Link to pre-commit install directions

* Wording

* Restore clang-format

* Fix yaml spacing

* Exclude templates folder for check-yaml

* Remove unused vars

* Normalize spacing

* Remove unused variable

* Normalize config indentation

* Update .clang-format to enforce max line length of 80

* Update copyrights

* Update copyrights

* Run workflows on every PR

* Fix copyright year

* Fix grammar

* Entrypoint.d files are not executable

* Run pre-commit hooks

* Mark not executable

* Run pre-commit hooks

* Remove unused variable

* Run pre-commit hooks after rebase

* Update copyrights

* Fix README.md typo (decoupled)

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Run pre-commit hooks

* Grammar fix

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Redundant word

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Revert docker file changes

* Executable shebang revert

* Make model.py files non-executable

* Passin is proper flag

* Run pre-commit hooks on init_args/model.py

* Fix typo in init_args/model.py

* Make copyrights one line

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Fix default instance name change when count is 1 (triton-inference-server#6088)

* Add test for sequence model instance update (triton-inference-server#5831)

* Add test for sequence model instance update

* Add gap for file timestamp update

* Update test for non-blocking sequence update

* Update documentation

* Remove mentioning increase instance count case

* Add more documentaion for scheduler update test

* Update test for non-blocking batcher removal

* Add polling due to async scheduler destruction

* Use _ as private

* Fix typo

* Add docs on instance count decrease

* Fix typo

* Separate direct and oldest to different test cases

* Separate nested tests in a loop into multiple test cases

* Refactor scheduler update test

* Improve doc on handling future test failures

* Address pre-commit

* Add best effort to reset model state after a single test case failure

* Remove reset model method to make harder for chaining multiple test cases as one

* Remove description on model state clean up

* Fix default instance name (triton-inference-server#6097)

* Removing unused tests (triton-inference-server#6085)

* Update post-23.07 release  (triton-inference-server#6103)

* Update README and versions for 2.36.0 / 23.07

* Update Dockerfile.win10.min

* Fix formating issue

* fix formating issue

* Fix whitespaces

* Fix whitespaces

* Fix whitespaces

* Improve asyncio testing (triton-inference-server#6122)

* Reduce instance count to 1 for python bls model loading test (triton-inference-server#6130)

* Reduce instance count to 1 for python bls model loading test

* Add comment when calling unload

* Fix queue test to expect exact number of failures (triton-inference-server#6133)

* Fix queue test to expect exact number of failures

* Increase the execution time to more accurately capture requests

* Add CPU & GPU metrics in Grafana dashboard.json for K8s op prem deployment (fix triton-inference-server#6047) (triton-inference-server#6100)

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Adding the support tracing of child models invoked from a BLS model (triton-inference-server#6063)

* Adding tests for bls

* Added fixme, cleaned previous commit

* Removed unused imports

* Fixing commit tree:
Refactor code, so that OTel tracer provider is initialized only once
Added resource cmd option, testig
Added docs

* Clean up

* Update docs/user_guide/trace.md

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Revision

* Update doc

* Clean up

* Added ostream exporter to OpenTelemetry for testing purposes; refactored trace tests

* Added opentelemetry trace collector set up to tests; refactored otel exporter tests to use OTel collector instead of netcat

* Revising according to comments

* Added comment regarding 'parent_span_id'

* Added permalink

* Adjusted test

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Test python environments 3.8-3.11 (triton-inference-server#6109)

Add tests for python 3.8-3.11 for L0_python_backends

* Improve L0_backend_python debugging (triton-inference-server#6157)

* Improve L0_backend_python debugging

* Use utils function for artifacts collection

* Add unreachable output test for reporting source of disconnectivity (triton-inference-server#6149)

* Update 'main' to track development of 2.38.0 / 23.09 (triton-inference-server#6163)

* Fix the versions in the doc (triton-inference-server#6164)

* Update docs with NVAIE messaging (triton-inference-server#6162)

Update docs with NVAIE messaging

* Add sanity tests for parallel instance loading (triton-inference-server#6126)

* Remove extra whitespace (triton-inference-server#6174)

* Remove a test case that sanity checks input value of --shape CLI flag (triton-inference-server#6140)

* Remove test checking for --shape option

* Remove the entire test

* Add test when unload/load requests for same model is received at the same time (triton-inference-server#6150)

* Add test when unload/load requests for same model received the same time

* Add test_same_model_overlapping_load_unload

* Use a load/unload stress test instead

* Pre-merge test name update

* Address pre-commit error

* Revert "Address pre-commit error"

This reverts commit 781cab1.

* Record number of occurrence of each exception

* Make assert failures clearer in L0_trt_plugin (triton-inference-server#6166)

* Add end-to-end CI test for decoupled model support (triton-inference-server#6131) (triton-inference-server#6184)

* Add end-to-end CI test for decoupled model support

* Address feedback

* Test preserve_ordering for oldest strategy sequence batcher (triton-inference-server#6185)

* added debugging guide (triton-inference-server#5924)

* added debugging guide

* Run pre-commit

---------

Co-authored-by: David Yastremsky <dyastremsky@nvidia.com>

* Add deadlock gdb section to debug guide (triton-inference-server#6193)

* Fix character escape in model repository documentation (triton-inference-server#6197)

* Fix docs test (triton-inference-server#6192)

* Add utility functions for array manipulation (triton-inference-server#6203)

* Add utility functions for outlier removal

* Fix functions

* Add newline to end of file

* Add gc collect to make sure gpu tensor is deallocated (triton-inference-server#6205)

* Testing: add gc collect to make sure gpu tensor is deallocated

* Address comment

* Check for log error on failing to find explicit load model (triton-inference-server#6204)

* Set default shm size to 1MB for Python backend (triton-inference-server#6209)

* Trace Model Name Validation (triton-inference-server#6199)

* Initial commit

* Cleanup using new standard formatting

* QA test restructuring

* Add newline to the end of test.sh

* HTTP/GRCP protocol changed to pivot on ready status & error status. Log file name changed in qa test.

* Fixing unhandled error memory leak

* Handle index function memory leak fix

* Fix the check for error message (triton-inference-server#6226)

* Fix copyright for debugging guide (triton-inference-server#6225)

* Add watts units to GPU power metric descriptions (triton-inference-server#6242)

* Update post-23.08 release  (triton-inference-server#6234)

* CUDA 12.1 > 12.2

* DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors

* Revert "DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors"

This reverts commit 0cecbb7.

* Update Dockerfile.win10.min

* Update Dockerfile.win10.min

* Update README and versions for 23.08 branch

* Update Dockerfile.win10

* Fix the versions in docs

* Add the note about stabilization of the branch

* Update docs with NVAIE messaging (triton-inference-server#6162) (triton-inference-server#6167)

Update docs with NVAIE messaging

Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com>

* Resolve merge conflict

---------

Co-authored-by: tanmayv25 <tanmay2592@gmail.com>
Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com>

* Add tests/docs for queue size (pending request count) metric (triton-inference-server#6233)

* Adding safe string to number conversions (triton-inference-server#6173)

* Added catch for out of range error for trace setting update

* Added wrapper to safe parse options

* Added option names to errors

* Adjustments

* Quick fix

* Fixing option name for Windows

* Removed repetitive code

* Adjust getopt_long for Windows to use longindex

* Moved try catch into ParseOption

* Removed unused input

* Improved names

* Refactoring and clean up

* Fixed Windows

* Refactored getopt_long for Windows

* Refactored trace test, pinned otel's collector version to avoid problems with go requirements

* Test Python execute() to return Triton error code (triton-inference-server#6228)

* Add test for Python execute error code

* Add all supported error codes into test

* Move ErrorCode into TritonError

* Expose ErrorCode internal in TritonError

* Add docs on IPv6 (triton-inference-server#6262)

* Add test for TensorRT version-compatible model support (triton-inference-server#6255)

* Add tensorrt version-compatibility test

* Generate one version-compatible model

* Fix copyright year

* Remove unnecessary variable

* Remove unnecessary line

* Generate TRT version-compatible model

* Add sample inference to TRT version-compatible test

* Clean up utils and model gen for new plan model

* Fix startswith capitalization

* Remove unused imports

* Remove unused imports

* Add log check

* Upgrade protobuf version (triton-inference-server#6268)

* Add testing for retrieving shape and datatype in backend API (triton-inference-server#6231)

Add testing for retrieving output shape and datatype info from backend API

* Update 'main' to track development of 2.39.0 / 23.10 (triton-inference-server#6277)

* Apply UCX workaround (triton-inference-server#6254)

* Add ensemble parameter forwarding test (triton-inference-server#6284)

* Exclude extra TRT version-compatible models from tests (triton-inference-server#6294)

* Exclude compatible models from tests.

* Force model removal, in case it does not exist

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Adding installation of docker and docker-buildx (triton-inference-server#6299)

* Adding installation of docker and docker-buildx

* remove whitespace

* Use targetmodel from header as model name in SageMaker (triton-inference-server#6147)

* Use targetmodel from header as model name in SageMaker

* Update naming for model hash

* Add more error messages, return codes, and refactor HTTP server (triton-inference-server#6297)

* Fix typo (triton-inference-server#6318)

* Update the request re-use example (triton-inference-server#6283)

* Update the request re-use example

* Review edit

* Review comment

* Disable developer tools build for In-process API + JavaCPP tests (triton-inference-server#6296)

* Add Python binding build. Add L0_python_api to test Python binding (triton-inference-server#6319)

* Add L0_python_api to test Python binding

* Install Python API in CI image

* Fix QA build

* Increase network timeout for valgrind (triton-inference-server#6324)

* Tests and docs for ability to specify subdirectory to download for LocalizePath (triton-inference-server#6308)

* Added custom localization tests for s3 and azure, added docs

* Refactor HandleInfer into more readable chunks (triton-inference-server#6332)

* Refactor model generation scripts (triton-inference-server#6336)

* Refactor model generation scripts

* Fix codeql

* Fix relative path import

* Fix package structure

* Copy the gen_common file

* Add missing uint8

* Remove duplicate import

* Add testing for scalar I/O in ORT backend (triton-inference-server#6343)

* Add testing for scalar I/O in ORT backend

* Review edit

* ci

* Update post-23.09 release (triton-inference-server#6367)

* Update README and versions for 23.09 branch (triton-inference-server#6280)

* Update `Dockerfile` and `build.py`  (triton-inference-server#6281)

* Update configuration for Windows Dockerfile (triton-inference-server#6256)

* Adding installation of docker and docker-buildx

* Enable '--expt-relaxed-constexpr' flag for custom ops models

* Upate Dockerfile version

* Disable unit tests for Jetson

* Update condition (triton-inference-server#6285)

* removing Whitespaces (triton-inference-server#6293)

* removing Whitespaces

* removing whitespaces

* Add security policy (triton-inference-server#6376)

* Adding client-side request cancellation support and testing (triton-inference-server#6383)

* Add L0_request_cancellation (triton-inference-server#6252)

* Add L0_request_cancellation

* Remove unittest test

* Add cancellation to gRPC server error handling

* Fix up

* Use identity model

* Add tests for gRPC client-side cancellation (triton-inference-server#6278)

* Add tests for gRPC client-side cancellation

* Fix CodeQL issues

* Formatting

* Update qa/L0_client_cancellation/client_cancellation_test.py

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Move to L0_request_cancellation

* Address review comments

* Removing request cancellation support from asyncio version

* Format

* Update copyright

* Remove tests

* Handle cancellation notification in gRPC server (triton-inference-server#6298)

* Handle cancellation notification in gRPC server

* Fix the request ptr initialization

* Update src/grpc/infer_handler.h

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Address review comment

* Fix logs

* Fix request complete callback by removing reference to state

* Improve documentation

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Fixes on the gRPC frontend to handle AsyncNotifyWhenDone() API (triton-inference-server#6345)

* Fix segmentation fault in gRPC frontend

* Finalize all states upon completion

* Fixes all state cleanups

* Handle completed states when cancellation notification is received

* Add more documentation steps

* Retrieve dormant states to minimize the memory footprint for long streams

* Update src/grpc/grpc_utils.h

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Use a boolean state instead of raw pointer

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Add L0_grpc_state_cleanup test (triton-inference-server#6353)

* Add L0_grpc_state_cleanup test

* Add model file in QA container

* Fix spelling

* Add remaining subtests

* Add failing subtests

* Format fixes

* Fix model repo

* Fix QA docker file

* Remove checks for the error message when shutting down server

* Fix spelling

* Address review comments

* Add schedulers request cancellation tests (triton-inference-server#6309)

* Add schedulers request cancellation tests

* Merge gRPC client test

* Reduce testing time and covers cancelling other requests as a consequence of request cancellation

* Add streaming request cancellation test

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Add missing copyright (triton-inference-server#6388)

* Add basic generate endpoints for LLM tasks (triton-inference-server#6366)

* PoC of parsing request prompt and converting to Triton infer request

* Remove extra trace

* Add generate endpoint

* Enable streaming version

* Fix bug

* Fix up

* Add basic testing. Cherry pick from triton-inference-server#6369

* format

* Address comment. Fix build

* Minor cleanup

* cleanup syntax

* Wrap error in SSE format

* Fix up

* Restrict number of response on non-streaming generate

* Address comment on implementation.

* Re-enable trace on generate endpoint

* Add more comprehensive llm endpoint tests (triton-inference-server#6377)

* Add security policy (triton-inference-server#6376)

* Start adding some more comprehensive tests

* Fix test case

* Add response error testing

* Complete test placeholder

* Address comment

* Address comments

* Fix code check

---------

Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: GuanLuo <gluo@nvidia.com>

* Address comment

* Address comment

* Address comment

* Fix typo

---------

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>

* Add Python backend request cancellation test (triton-inference-server#6364)

* Add cancelled response status test

* Add Python backend request cancellation test

* Add Python backend decoupled request cancellation test

* Simplified response if cancelled

* Test response_sender.send() after closed

* Rollback test response_sender.send() after closed

* Rollback non-decoupled any response on cancel

* Add TRT-LLM backend build to Triton (triton-inference-server#6365) (triton-inference-server#6392)

* Add TRT-LLM backend build to Triton (triton-inference-server#6365)

* Add trtllm backend to build

* Temporarily adding version map for 23.07

* Fix build issue

* Update comment

* Comment out python binding changes

* Add post build

* Update trtllm backend naming

* Update TRTLLM base image

* Fix cmake arch

* Revert temp changes for python binding PR

* Address comment

* Move import to the top (triton-inference-server#6395)

* Move import to the top

* pre commit format

* Add Python backend when vLLM backend built (triton-inference-server#6397)

* Update build.py to build vLLM backend (triton-inference-server#6394)

* Support parameters object in generate route

* Update 'main' to track development of 2.40.0 / 23.11 (triton-inference-server#6400)

* Fix L0_sdk (triton-inference-server#6387)

* Add documentation on request cancellation (triton-inference-server#6403)

* Add documentation on request cancellation

* Include python backend

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update docs/README.md

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Remove inflight term from the main documentation

* Address review comments

* Fix

* Update docs/user_guide/request_cancellation.md

Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Fix

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Fixes in request cancellation doc (triton-inference-server#6409)

* Document generate HTTP endpoint (triton-inference-server#6412)

* Document generate HTTP endpoint

* Address comment

* Fix up

* format

* Address comment

* Update SECURITY.md to not display commented copyright (triton-inference-server#6426)

* Fix missing library in L0_data_compression (triton-inference-server#6424)

* Fix missing library in L0_data_compression

* Fix up

* Add Javacpp-presets repo location as env variable in Java tests(triton-inference-server#6385)

Simplify testing when upstream (javacpp-presets) build changes. Related to triton-inference-server/client#409

* TRT-LLM backend build changes (triton-inference-server#6406)

* Update url

* Debugging

* Debugging

* Update url

* Fix build for TRT-LLM backend

* Remove TRTLLM TRT and CUDA versions

* Fix up unused var

* Fix up dir name

* FIx cmake patch

* Remove previous TRT version

* Install required packages for example models

* Remove packages that are only needed for testing

* Add gRPC AsyncIO request cancellation tests (triton-inference-server#6408)

* Fix gRPC test failure and refactor

* Add gRPC AsyncIO cancellation tests

* Better check if a request is cancelled

* Use f-string

* Fix L0_implicit_state (triton-inference-server#6427)

* Fixing vllm build (triton-inference-server#6433)

* Fixing torch version for vllm

* Switch Jetson model TensorRT models generation to container (triton-inference-server#6378)

* Switch Jetson model TensorRT models generation to container

* Adding missed file

* Fix typo

* Fix typos

* Remove extra spaces

* Fix typo

* Bumped vllm version (triton-inference-server#6444)

* Adjust test_concurrent_same_model_load_unload_stress (triton-inference-server#6436)

* Adding emergency vllm latest release (triton-inference-server#6454)

* Fix notify state destruction and inflight states tracking (triton-inference-server#6451)

* Ensure notify_state_ gets properly destructed

* Fix inflight state tracking to properly erase states

* Prevent removing the notify_state from being erased

* Wrap notify_state_ object within unique_ptr

* Update TRT-LLM backend url (triton-inference-server#6455)

* TRTLLM backend post release

* TRTLLM backend post release

* Update submodule url for permission issue

* Update submodule url

* Fix up

* Not using postbuild function to workaround submodule url permission issue

* Added docs on python based backends (triton-inference-server#6429)


Co-authored-by: Neelay Shah <neelays@nvidia.com>

* L0_model_config Fix (triton-inference-server#6472)

* Minor fix for L0_model_config

* Add test for Python model parameters (triton-inference-server#6452)

* Test Python BLS with different sizes of CUDA memory pool (triton-inference-server#6276)

* Test with different sizes of CUDA memory pool

* Check the server log for error message

* Improve debugging

* Fix syntax

* Add documentation for K8s-onprem StartupProbe (triton-inference-server#5257)

Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com>

* Update `main` post-23.10 release   (triton-inference-server#6484)

* Update README and versions for 23.10 branch (triton-inference-server#6399)

* Cherry-picking vLLM backend changes (triton-inference-server#6404)

* Update build.py to build vLLM backend (triton-inference-server#6394)

* Add Python backend when vLLM backend built (triton-inference-server#6397)

---------

Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>

* Add documentation on request cancellation (triton-inference-server#6403) (triton-inference-server#6407)

* Add documentation on request cancellation

* Include python backend

* Update docs/user_guide/request_cancellation.md

* Update docs/user_guide/request_cancellation.md

* Update docs/README.md

* Update docs/user_guide/request_cancellation.md

* Remove inflight term from the main documentation

* Address review comments

* Fix

* Update docs/user_guide/request_cancellation.md

* Fix

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>

* Fixes in request cancellation doc (triton-inference-server#6409) (triton-inference-server#6410)

* TRT-LLM backend build changes (triton-inference-server#6406) (triton-inference-server#6430)

* Update url

* Debugging

* Debugging

* Update url

* Fix build for TRT-LLM backend

* Remove TRTLLM TRT and CUDA versions

* Fix up unused var

* Fix up dir name

* FIx cmake patch

* Remove previous TRT version

* Install required packages for example models

* Remove packages that are only needed for testing

* Fixing vllm build (triton-inference-server#6433) (triton-inference-server#6437)

* Fixing torch version for vllm

Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

* Update TRT-LLM backend url (triton-inference-server#6455) (triton-inference-server#6460)

* TRTLLM backend post release

* TRTLLM backend post release

* Update submodule url for permission issue

* Update submodule url

* Fix up

* Not using postbuild function to workaround submodule url permission issue

* remove redundant lines

* Revert "remove redundant lines"

This reverts commit 86be7ad.

* restore missed lines

* Update build.py

Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

* Update build.py

Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

---------

Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: Kris Hung <krish@nvidia.com>
Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com>
Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

* Adding structure reference to the new document (triton-inference-server#6493)

* Improve L0_backend_python test stability (ensemble / gpu_tensor_lifecycle) (triton-inference-server#6490)

* Test torch allocator gpu memory usage directly rather than global gpu memory for more consistency

* Add L0_generative_sequence test (triton-inference-server#6475)

* Add testing backend and test

* Add test to build / CI. Minor fix on L0_http

* Format. Update backend documentation

* Fix up

* Address comment

* Add negative testing

* Fix up

* Downgrade vcpkg version (triton-inference-server#6503)

* Collecting sub dir artifacts in GitLab yaml. Removing collect function from test script. (triton-inference-server#6499)

* Use post build function for TRT-LLM backend (triton-inference-server#6476)

* Use postbuild function

* Remove updating submodule url

* Enhanced python_backend autocomplete (triton-inference-server#6504)

* Added testing for python_backend autocomplete: optional input and model_transaction_policy

* Parse reuse-grpc-port and reuse-http-port as booleans (triton-inference-server#6511)

Co-authored-by: Francesco Petrini <francescogpetrini@gmail.com>

* Fixing L0_io (triton-inference-server#6510)

* Fixing L0_io

* Add Python-based backends CI (triton-inference-server#6466)

* Bumped vllm version

* Add python-bsed backends testing

* Add python-based backends CI

* Fix errors

* Add vllm backend

* Fix pre-commit

* Modify test.sh

* Remove vllm_opt qa model

* Remove vLLM ackend tests

* Resolve review comments

* Fix pre-commit errors

* Update qa/L0_backend_python/python_based_backends/python_based_backends_test.py

Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>

* Remove collect_artifacts_from_subdir function call

---------

Co-authored-by: oandreeva-nv <oandreeva@nvidia.com>
Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>

* Enabling option to restrict access to HTTP APIs based on header value pairs (similar to gRPC)

* Upgrade DCGM from 2.4.7 to 3.2.6 (triton-inference-server#6515)

* Enhance GCS credentials documentations (triton-inference-server#6526)

* Test file override outside of model directory (triton-inference-server#6516)

* Add boost-filesystem

* Update ORT version to 1.16.2 (triton-inference-server#6531)

* Adjusting expected error msg (triton-inference-server#6517)

* Update 'main' to track development of 2.41.0 / 23.12 (triton-inference-server#6543)

* Enhance testing for pending request count (triton-inference-server#6532)

* Enhance testing for pending request count

* Improve the documentation

* Add more documentation

* Add testing for Python backend request rescheduling (triton-inference-server#6509)

* Add testing

* Fix up

* Enhance testing

* Fix up

* Revert test changes

* Add grpc endpoint test

* Remove unused import

* Remove unused import

* Update qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Update qa/python_models/bls_request_rescheduling/model.py

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

---------

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>

* Check that the wget is installed (triton-inference-server#6556)

* secure deployment considerations guide (triton-inference-server#6533)

* draft document

* updates

* updates

* updated

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* update

* updates

* updates

* Update docs/customization_guide/deploy.md

Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>

* Update docs/customization_guide/deploy.md

Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>

* fixing typos

* updated with clearer warnings

* updates to readme and toc

---------

Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com>

* Fix typo and change the command line order (triton-inference-server#6557)

* Fix typo and change the command line order

* Improve visual experience. Add 'clang' package

* Add error during rescheduling test to L0_generative_sequence (triton-inference-server#6550)

* changing references to concrete instances

* Add testing for implicit state enhancements (triton-inference-server#6524)

* Add testing for single buffer

* Add testing for implicit state with buffer growth

* Improve testing

* Fix up

* Add CUDA virtual address size flag

* Add missing test files

* Parameter rename

* Test fixes

* Only build implicit state backend for GPU=ON

* Fix copyright (triton-inference-server#6584)

* Mention TRT LLM backend supports request cancellation (triton-inference-server#6585)

* update model repository generation for onnx models for protobuf (triton-inference-server#6575)

* Fix L0_sagemaker (triton-inference-server#6587)

* Add C++ server wrapper to the doc (triton-inference-server#6592)

* Add timeout to client apis and tests (triton-inference-server#6546)

Client PR: triton-inference-server/client#429

* Change name generative -> iterative (triton-inference-server#6601)

* name changes

* updated names

* Add documentation on generative sequence (triton-inference-server#6595)

* Add documentation on generative sequence

* Address comment

* Reflect the "iterative" change

* Updated description of iterative sequences

* Restricted HTTP API documentation 

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>

* Add request cancellation and debugging guide to generated docs (triton-inference-server#6617)

* Support for http request cancellation. Includes fix for seg fault in generate_stream endpoint.

* Bumped vLLM version to v0.2.2 (triton-inference-server#6623)

* Upgrade ORT version (triton-inference-server#6618)

* Use compliant preprocessor (triton-inference-server#6626)

* Update README.md (triton-inference-server#6627)

* Extend request objects lifetime and fixes possible segmentation fault (triton-inference-server#6620)

* Extend request objects lifetime

* Remove explicit TRITONSERVER_InferenceRequestDelete

* Format fix

* Include the inference_request_ initialization to cover RequestNew

---------

Co-authored-by: Neelay Shah <neelays@nvidia.com>

* Update protobuf after python update for testing (triton-inference-server#6638)

This fixes the issue where python client has
`AttributeError: 'NoneType' object has no attribute 'enum_types_by_name'
errors after python version is updated.

* Update post-23.11 release (triton-inference-server#6653)

* Update README and versions for 2.40.0 / 23.11 (triton-inference-server#6544)

* Removing path construction to use SymLink alternatives

* Update version for PyTorch

* Update windows Dockerfile configuration

* Update triton version to 23.11

* Update README and versions for 2.40.0 / 23.11

* Fix typo

* Ading 'ldconfig' to configure dynamic linking in container (triton-inference-server#6602)

* Point to tekit_backend (triton-inference-server#6616)

* Point to tekit_backend

* Update version

* Revert tekit changes (triton-inference-server#6640)

---------

Co-authored-by: Kris Hung <krish@nvidia.com>

* PYBE Timeout Tests (triton-inference-server#6483)

* New testing to confirm large request timeout values can be passed and retrieved within Python BLS models.

* Add note on lack of ensemble support (triton-inference-server#6648)

* Added request id to span attributes (triton-inference-server#6667)

* Add test for optional internal tensor within an ensemble (triton-inference-server#6663)

* Add test for optional internal tensor within an ensemble

* Fix up

* Set CMake version to 3.27.7 (triton-inference-server#6675)

* Set CMake version to 3.27.7

* Set CMake version to 3.27.7

* Fix double slash typo

* restore typo (triton-inference-server#6680)

* Update 'main' to track development of 2.42.0 / 24.01 (triton-inference-server#6673)

* iGPU build refactor (triton-inference-server#6684) (triton-inference-server#6691)

* Mlflow Plugin Fix (triton-inference-server#6685)

* Mlflow plugin fix

* Fix extra content-type headers in HTTP server (triton-inference-server#6678)

* Fix iGPU CMakeFile tags (triton-inference-server#6695)

* Unify iGPU test build with x86 ARM

* adding TRITON_IGPU_BUILD to core build definition; adding logic to skip caffe2plan test if TRITON_IGPU_BUILD=1

* re-organizing some copies in Dockerfile.QA to fix igpu devel build

* Pre-commit fix

---------

Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com>

* adding default value for TRITON_IGPU_BUILD=OFF (triton-inference-server#6705)

* adding default value for TRITON_IGPU_BUILD=OFF

* fix newline

---------

Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com>

* Add test case for decoupled model raising exception (triton-inference-server#6686)

* Add test case for decoupled model raising exception

* Remove unused import

* Address comment

* Escape special characters in general docs (triton-inference-server#6697)

* vLLM Benchmarking Test (triton-inference-server#6631)

* vLLM Benchmarking Test

* Allow configuring GRPC max connection age and max connection age grace (triton-inference-server#6639)

* Add ability to configure GRPC max connection age and max connection age grace
* Allow pass GRPC connection age args when they are set from command
----------
Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>
Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: Neelay Shah <neelays@nvidia.com>
Co-authored-by: Tanmay Verma <tanmay2592@gmail.com>
Co-authored-by: Kris Hung <krish@nvidia.com>
Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>
Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com>
Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
Co-authored-by: Gerard Casas Saez <gerardc@squareup.com>
Co-authored-by: Misha Chornyi <99709299+mc-nv@users.noreply.github.com>
Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com>
Co-authored-by: Elias Bermudez <6505145+debermudez@users.noreply.github.com>
Co-authored-by: ax-vivien <113907557+ax-vivien@users.noreply.github.com>
Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com>
Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com>
Co-authored-by: Matthew Kotila <matthew.r.kotila@gmail.com>
Co-authored-by: Nikhil Kulkarni <knikhil29@gmail.com>
Co-authored-by: Misha Chornyi <mchornyi@nvidia.com>
Co-authored-by: Iman Tabrizian <itabrizian@nvidia.com>
Co-authored-by: David Yastremsky <dyastremsky@nvidia.com>
Co-authored-by: Timothy Gerdes <50968584+tgerdesnv@users.noreply.github.com>
Co-authored-by: Mate Mijolović <mate.mijolovic@gmail.com>
Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com>
Co-authored-by: Hyunjae Woo <107147848+nv-hwoo@users.noreply.github.com>
Co-authored-by: Tanay Varshney <tvarshney@nvidia.com>
Co-authored-by: Francesco Petrini <francescogpetrini@gmail.com>
Co-authored-by: Dmitry Mironov <dmitrym@nvidia.com>
Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com>
Co-authored-by: Sai Kiran Polisetty <spolisetty@nvidia.com>
Co-authored-by: oandreeva-nv <oandreeva@nvidia.com>
Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com>
Co-authored-by: Neal Vaidya <nealv@nvidia.com>
Co-authored-by: siweili11 <152239970+siweili11@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants