-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor HTTP server #6297
Refactor HTTP server #6297
Conversation
…sage in body more consistently for most routes
return; \ | ||
} while (false) | ||
|
||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers
I can swap this EVBufferJson block to be with the rest of the helpers and move the HTTPServer Start/Stop and HTTPMetrics server to be down after the helpers with the rest of the server code in a quick follow-up PR.
Doing so in this PR introduces a lot of noise to the amount of actual changes in the diffs.
src/http_server.cc
Outdated
@@ -142,8 +180,7 @@ HTTPMetricsServer::Handle(evhtp_request_t* req) | |||
<< req->uri->path->full; | |||
|
|||
if (req->method != htp_method_GET) { | |||
evhtp_send_reply(req, EVHTP_RES_METHNALLOWED); | |||
return; | |||
HTTP_RESPOND_WITH_CODE(req, EVHTP_RES_METHNALLOWED, "Method Not Allowed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers
Eventually I think we can provide some mapping between code -> message, but done inline for simplicity. We don't always provide the same error message for 400 (EVHTP_RES_BADREQ
) code.
…OND -> RETURN_AND_RESPOND for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
* Changed copyright (triton-inference-server#5705) * Modify timeout test in L0_sequence_batcher to use portable backend (triton-inference-server#5696) * Modify timeout test in L0_sequence_batcher to use portable backend * Use identity backend that is built by default on Windows * updated upstream container name (triton-inference-server#5713) * Fix triton container version (triton-inference-server#5714) * Update the L0_model_config test expected error message (triton-inference-server#5684) * Use better value in timeout test L0_sequence_batcher (triton-inference-server#5716) * Use better value in timeout test L0_sequence_batcher * Format * Update JAX install (triton-inference-server#5613) * Add notes about socket usage to L0_client_memory_growth test (triton-inference-server#5710) * Check TensorRT error message more granularly (triton-inference-server#5719) * Check TRT err msg more granularly * Clarify source of error messages * Consolidate tests for message parts * Pin Python Package Versions for HTML Document Generation (triton-inference-server#5727) * updating with pinned versions for python dependencies * updated with pinned sphinx and nbclient versions * Test full error returned when custom batcher init fails (triton-inference-server#5729) * Add testing for batcher init failure, add wait for status check * Formatting * Change search string * Add fastertransformer test (triton-inference-server#5500) Add fastertransformer test that uses 1GPU. * Fix L0_backend_python on Jetson (triton-inference-server#5728) * Don't use mem probe in Jetson * Clarify failure messages in L0_backend_python * Update copyright * Add JIRA ref, fix _test_jetson * Add testing for Python custom metrics API (triton-inference-server#5669) * Add testing for python custom metrics API * Add custom metrics example to the test * Fix for CodeQL report * Fix test name * Address comment * Add logger and change the enum usage * Add testing for Triton Client Plugin API (triton-inference-server#5706) * Add HTTP client plugin test * Add testing for HTTP asyncio * Add async plugin support * Fix qa container for L0_grpc * Add testing for grpc client plugin * Remove unused imports * Fix up * Fix L0_grpc models QA folder * Update the test based on review feedback * Remove unused import * Add testing for .plugin method * Install jemalloc (triton-inference-server#5738) * Add --metrics-address and testing (triton-inference-server#5737) * Add --metrics-address, add tests to L0_socket, re-order CLI options for consistency * Use non-localhost address * Add testing for basic auth plugin for HTTP/gRPC clients (triton-inference-server#5739) * Add HTTP basic auth test * Add testing for gRPC basic auth * Fix up * Remove unused imports * Add multi-gpu, multi-stream testing for dlpack tensors (triton-inference-server#5550) * Add multi-gpu, multi-stream testing for dlpack tensors * Update note on SageMaker MME support for ensemble (triton-inference-server#5723) * Run L0_backend_python subtests with virtual environment (triton-inference-server#5753) * Update 'main' to track development of 2.35.0 / r23.06 (triton-inference-server#5764) * Include jemalloc into the documentation (triton-inference-server#5760) * Enhance tests in L0_model_update (triton-inference-server#5724) * Add model instance name update test * Add gap for timestamp to update * Add some tests with dynamic batching * Extend supported test on rate limit off * Continue test if off mode failed * Fix L0_memory_growth (triton-inference-server#5795) (1) reduce MAX_ALLOWED_ALLOC to be more strict for bounded tests, and generous for unbounded tests. (2) allow unstable measurement from PA. (3) improve logging for future triage * Add note on --metrics-address (triton-inference-server#5800) * Add note on --metrics-address * Copyright * Minor fix for running "mlflow deployments create -t triton --flavor triton ..." (triton-inference-server#5658) UnboundLocalError: local variable 'meta_dict' referenced before assignment The above error shows in listing models in Triton model repository * Adding test for new sequence mode (triton-inference-server#5771) * Adding test for new sequence mode * Update option name * Clean up testing spacing and new lines * MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL (triton-inference-server#5686) * MLFlow Triton Plugin: Add support for s3 prefix and custom endpoint URL Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Update the function order of config.py and use os.path.join to replace filtering a list of strings then joining Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Update onnx flavor to support s3 prefix and custom endpoint URL Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Fix two typos in MLFlow Triton plugin README.md Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments (replace => strip) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Address review comments (init regex only for s3) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Remove unused local variable: slash_locations Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> --------- Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Fix client script (triton-inference-server#5806) * Add MLFlow test for already loaded models. Update copyright year (triton-inference-server#5808) * Use the correct gtest filter (triton-inference-server#5824) * Add error message test on S3 access decline (triton-inference-server#5825) * Add test on access decline * Fix typo * Add MinIO S3 access decline test * Make sure bucket exists during access decline test * Restore AWS_SECRET_ACCESS_KEY on S3 local test (triton-inference-server#5832) * Restore AWS_SECRET_ACCESS_KEY * Add reason for restoring keys * nnshah1 stream infer segfault fix (triton-inference-server#5842) match logic from infer_handler.cc * Remove unused test (triton-inference-server#5851) * Add and document memory usage in statistic protocol (triton-inference-server#5642) * Add and document memory usage in statistic protocol * Fix doc * Fix up * [DO NOT MERGE Add test. FIXME: model generation * Fix up * Fix style * Address comment * Fix up * Set memory tracker backend option in build.py * Fix up * Add CUPTI library in Windows image build * Add note to build with memory tracker by default * use correct lib dir on CentOS (triton-inference-server#5836) * use correct lib dir on CentOS * use new location for opentelemetry-cpp * Document that gpu-base flag is optional for cpu-only builds (triton-inference-server#5861) * Update Jetson tests in Docker container (triton-inference-server#5734) * Add flags for ORT build * Separate list with commas * Remove unnecessary detection of nvcc compiler * Fixed Jetson path for perf_client, datadir * Create version directoryy for custom model * Remove probe check for shm, add shm exceed error for Jetson * Copyright updates, fix Jetson Probe * Fix be_python test num on Jetson * Remove extra comma, non-Dockerized Jetson comment * Remove comment about Jetson being non-dockerized * Remove no longer needed flag * Update `main` post-23.05 release (triton-inference-server#5880) * Update README and versions for 23.05 branch * Changes to support 23.05 (triton-inference-server#5782) * Update python and conda version * Update CMAKE installation * Update checksum version * Update ubuntu base image to 22.04 * Use ORT 1.15.0 * Set CMAKE to pull latest version * Update libre package version * Removing unused argument * Adding condition for ubuntu 22.04 * Removing installation of the package from the devel container * Nnshah1 u22.04 (triton-inference-server#5770) * Update CMAKE installation * Update python and conda version * Update CMAKE installation * Update checksum version * Update ubuntu base image to 22.04 * updating versions for ubuntu 22.04 * remove re2 --------- Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> * Set ONNX version to 1.13.0 * Fix L0_custom_ops for ubuntu 22.04 (triton-inference-server#5775) * add back rapidjson-dev --------- Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com> * Fix L0_mlflow (triton-inference-server#5805) * working thread * remove default install of blinker * merge issue fixed * Fix L0_backend_python/env test (triton-inference-server#5799) * Fix L0_backend_python/env test * Address comment * Update the copyright * Fix up * Fix L0_http_fuzz (triton-inference-server#5776) * installing python 3.8.16 for test * spelling Co-authored-by: Neelay Shah <neelays@nvidia.com> * use util functions to install python3.8 in an easier way --------- Co-authored-by: Neelay Shah <neelays@nvidia.com> * Update Windows versions for 23.05 release (triton-inference-server#5826) * Rename Ubuntu 20.04 mentions to 22.04 (triton-inference-server#5849) * Update DCGM version (triton-inference-server#5856) * Update DCGM version (triton-inference-server#5857) * downgrade DCGM version to 2.4.7 (triton-inference-server#5860) * Updating link for latest release notes to 23.05 --------- Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com> Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> * Disable memory tracker on Jetpack until the library is available (triton-inference-server#5882) * Fix datadir for x86 (triton-inference-server#5894) * Add more test on instance signature (triton-inference-server#5852) * Add testing for new error handling API (triton-inference-server#5892) * Test batch input for libtorch (triton-inference-server#5855) * Draft ragged TensorRT unit model gen * Draft libtorch special identity model * Autoformat * Update test, fix ragged model gen * Update suffix for io for libtorch * Remove unused variables * Fix io names for libtorch * Use INPUT0/OUTPUT0 for libtorch * Reorder to match test model configs * Remove unnecessary capitalization * Auto-format * Capitalization is necessary * Remove unnecessary export * Clean up Azure dependency in server build (triton-inference-server#5900) * [DO NOT MERGE] * Remove Azure dependency in server component build * Finalize * Fix dependency * Fixing up * Clean up * Add response parameters for streaming GRPC inference to enhance decoupled support (triton-inference-server#5878) * Update 'main' to track development of 2.36.0 / 23.07 (triton-inference-server#5917) * Add test for detecting S3 http2 upgrade request (triton-inference-server#5911) * Add test for detecting S3 http2 upgrade request * Enhance testing * Copyright year update * Add Redis cache build, tests, and docs (triton-inference-server#5916) * Updated handling for uint64 request priority * Ensure HPCX dependencies found in container (triton-inference-server#5922) * Add HPCX dependencies to search path * Copy hpcx to CPU-only container * Add ucc path to CPU-only image * Fixed if statement * Fix df variable * Combine hpcx LD_LIBRARY_PATH * Add test case where MetricFamily is deleted before deleting Metric (triton-inference-server#5915) * Add test case for metric lifetime error handling * Address comment * Use different MetricFamily name * Add testing for Pytorch instance group kind MODEL (triton-inference-server#5810) * Add testing for Pytorch instance group kind MODEL * Remove unused item * Update testing to verify the infer result * Add copyright * Remove unused import * Update pip install * Update the model to use the same add sub logic * Add torch multi-gpu and multi-device models to L0_io * Fix up model version * Add test for sending instance update config via load API (triton-inference-server#5937) * Add test for passing config via load api * Add more docs on instance update behavior * Update to suggested docs Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Use dictionary for json config * Modify the config fetched from Triton instead --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Fix L0_batcher count check (triton-inference-server#5939) * Add testing for json tensor format (triton-inference-server#5914) * Add redis config and use local logfile for redis server (triton-inference-server#5945) * Add redis config and use local logfile for redis server * Move redis log config to CLI * Have separate redis logs for unit tests and CLI tests * Add test on rate limiter max resource decrease update (triton-inference-server#5885) * Add test on rate limiter max resource decrease update * Add test with explicit resource * Check server log for decreased resource limit * Add docs on decoupled final response feature (triton-inference-server#5936) * Allow changing ping behavior based on env variable in SageMaker and entrypoint updates (triton-inference-server#5910) * Allow changing ping behavior based on env variable in SageMaker * Add option for additional args * Make ping further configurable * Allow further configuration of grpc and http ports * Update docker/sagemaker/serve * Update docker/sagemaker/serve --------- Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> * Remove only MPI libraries in HPCX in L0_perf_analyzer (triton-inference-server#5967) * Be more specific with MPI removal * Delete all libmpi libs * Ensure L0_batch_input requests received in order (triton-inference-server#5963) * Add print statements for debugging * Add debugging print statements * Test using grpc client with stream to fix race * Use streaming client in all non-batch tests * Switch all clients to streaming GRPC * Remove unused imports, vars * Address comments * Remove random comment * Set inputs as separate function * Split set inputs based on test type * Add test for redis cache auth credentials via env vars (triton-inference-server#5966) * Auto-formatting (triton-inference-server#5979) * Auto-format * Change to clang-format-15 in CONTRIBTUING * Adding tests ensuring locale setting is passed to python backend interpreter * Refactor build.py CPU-only Linux libs for readability (triton-inference-server#5990) * Improve the error message when the number of GPUs is insufficient (triton-inference-server#5993) * Update README to include CPP-API Java Bindings (triton-inference-server#5883) * Update env variable to use for overriding /ping behavior (triton-inference-server#5994) * Add test that >1000 model files can be loaded in S3 (triton-inference-server#5976) * Add test for >1000 files * Capitalization for consistency * Add bucket cleaning at end * Move test pass/fail to end * Check number of files in model dir at load time * Add testing for GPU tensor error handling (triton-inference-server#5871) * Add testing for GPU tensor error handling * Fix up * Remove exit 0 * Fix jetson * Fix up * Add test for Python BLS model loading API (triton-inference-server#5980) * Add test for Python BLS model loading API * Fix up * Update README and versions for 23.06 branch * Fix LD_LIBRARY_PATH for PyTorch backend * Return updated df in add_cpu_libs * Remove unneeded df param * Update test failure messages to match Dataloader changes (triton-inference-server#6006) * Add dependency for L0_python_client_unit_tests (triton-inference-server#6010) * Improve performance tuning guide (triton-inference-server#6026) * Enabling nested spans for trace mode OpenTelemetry (triton-inference-server#5928) * Adding nested spans to OTel tracing + support of ensemble models * Move multi-GPU dlpack test to a separate L0 test (triton-inference-server#6001) * Move multi-GPU dlpack test to a separate L0 test * Fix copyright * Fix up * OpenVINO 2023.0.0 (triton-inference-server#6031) * Upgrade OV to 2023.0.0 * Upgrade OV model gen script to 2023.0.0 * Add test to check the output memory type for onnx models (triton-inference-server#6033) * Add test to check the output memory type for onnx models * Remove unused import * Address comment * Add testing for implicit state for PyTorch backend (triton-inference-server#6016) * Add testing for implicit state for PyTorch backend * Add testing for libtorch string implicit models * Fix CodeQL * Mention that libtorch backend supports implicit state * Fix CodeQL * Review edits * Fix output tests for PyTorch backend * Allow uncompressed conda execution enviroments (triton-inference-server#6005) Add test for uncompressed conda execution enviroments * Fix implicit state test (triton-inference-server#6039) * Adding target_compile_features cxx_std_17 to tracing lib (triton-inference-server#6040) * Update 'main' to track development of 2.37.0 / 23.08 * Fix intermittent failure in L0_model_namespacing (triton-inference-server#6052) * Fix PyTorch implicit model mounting in gen_qa_model_repository (triton-inference-server#6054) * Fix broken links pointing to the `grpc_server.cc` file (triton-inference-server#6068) * Fix L0_backend_python expected instance name (triton-inference-server#6073) * Fix expected instance name * Copyright year * Fix L0_sdk: update the search name for the client wheel (triton-inference-server#6074) * Fix name of client wheel to be looked for * Fix up * Add GitHub action to format and lint code (triton-inference-server#6022) * Add pre-commit * Fix typos, exec/shebang, formatting * Remove clang-format * Update contributing md to include pre-commit * Update spacing in CONTRIBUTING * Fix contributing pre-commit link * Link to pre-commit install directions * Wording * Restore clang-format * Fix yaml spacing * Exclude templates folder for check-yaml * Remove unused vars * Normalize spacing * Remove unused variable * Normalize config indentation * Update .clang-format to enforce max line length of 80 * Update copyrights * Update copyrights * Run workflows on every PR * Fix copyright year * Fix grammar * Entrypoint.d files are not executable * Run pre-commit hooks * Mark not executable * Run pre-commit hooks * Remove unused variable * Run pre-commit hooks after rebase * Update copyrights * Fix README.md typo (decoupled) Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Run pre-commit hooks * Grammar fix Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Redundant word Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Revert docker file changes * Executable shebang revert * Make model.py files non-executable * Passin is proper flag * Run pre-commit hooks on init_args/model.py * Fix typo in init_args/model.py * Make copyrights one line --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Fix default instance name change when count is 1 (triton-inference-server#6088) * Add test for sequence model instance update (triton-inference-server#5831) * Add test for sequence model instance update * Add gap for file timestamp update * Update test for non-blocking sequence update * Update documentation * Remove mentioning increase instance count case * Add more documentaion for scheduler update test * Update test for non-blocking batcher removal * Add polling due to async scheduler destruction * Use _ as private * Fix typo * Add docs on instance count decrease * Fix typo * Separate direct and oldest to different test cases * Separate nested tests in a loop into multiple test cases * Refactor scheduler update test * Improve doc on handling future test failures * Address pre-commit * Add best effort to reset model state after a single test case failure * Remove reset model method to make harder for chaining multiple test cases as one * Remove description on model state clean up * Fix default instance name (triton-inference-server#6097) * Removing unused tests (triton-inference-server#6085) * Update post-23.07 release (triton-inference-server#6103) * Update README and versions for 2.36.0 / 23.07 * Update Dockerfile.win10.min * Fix formating issue * fix formating issue * Fix whitespaces * Fix whitespaces * Fix whitespaces * Improve asyncio testing (triton-inference-server#6122) * Reduce instance count to 1 for python bls model loading test (triton-inference-server#6130) * Reduce instance count to 1 for python bls model loading test * Add comment when calling unload * Fix queue test to expect exact number of failures (triton-inference-server#6133) * Fix queue test to expect exact number of failures * Increase the execution time to more accurately capture requests * Add CPU & GPU metrics in Grafana dashboard.json for K8s op prem deployment (fix triton-inference-server#6047) (triton-inference-server#6100) Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> * Adding the support tracing of child models invoked from a BLS model (triton-inference-server#6063) * Adding tests for bls * Added fixme, cleaned previous commit * Removed unused imports * Fixing commit tree: Refactor code, so that OTel tracer provider is initialized only once Added resource cmd option, testig Added docs * Clean up * Update docs/user_guide/trace.md Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Revision * Update doc * Clean up * Added ostream exporter to OpenTelemetry for testing purposes; refactored trace tests * Added opentelemetry trace collector set up to tests; refactored otel exporter tests to use OTel collector instead of netcat * Revising according to comments * Added comment regarding 'parent_span_id' * Added permalink * Adjusted test --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Test python environments 3.8-3.11 (triton-inference-server#6109) Add tests for python 3.8-3.11 for L0_python_backends * Improve L0_backend_python debugging (triton-inference-server#6157) * Improve L0_backend_python debugging * Use utils function for artifacts collection * Add unreachable output test for reporting source of disconnectivity (triton-inference-server#6149) * Update 'main' to track development of 2.38.0 / 23.09 (triton-inference-server#6163) * Fix the versions in the doc (triton-inference-server#6164) * Update docs with NVAIE messaging (triton-inference-server#6162) Update docs with NVAIE messaging * Add sanity tests for parallel instance loading (triton-inference-server#6126) * Remove extra whitespace (triton-inference-server#6174) * Remove a test case that sanity checks input value of --shape CLI flag (triton-inference-server#6140) * Remove test checking for --shape option * Remove the entire test * Add test when unload/load requests for same model is received at the same time (triton-inference-server#6150) * Add test when unload/load requests for same model received the same time * Add test_same_model_overlapping_load_unload * Use a load/unload stress test instead * Pre-merge test name update * Address pre-commit error * Revert "Address pre-commit error" This reverts commit 781cab1. * Record number of occurrence of each exception * Make assert failures clearer in L0_trt_plugin (triton-inference-server#6166) * Add end-to-end CI test for decoupled model support (triton-inference-server#6131) (triton-inference-server#6184) * Add end-to-end CI test for decoupled model support * Address feedback * Test preserve_ordering for oldest strategy sequence batcher (triton-inference-server#6185) * added debugging guide (triton-inference-server#5924) * added debugging guide * Run pre-commit --------- Co-authored-by: David Yastremsky <dyastremsky@nvidia.com> * Add deadlock gdb section to debug guide (triton-inference-server#6193) * Fix character escape in model repository documentation (triton-inference-server#6197) * Fix docs test (triton-inference-server#6192) * Add utility functions for array manipulation (triton-inference-server#6203) * Add utility functions for outlier removal * Fix functions * Add newline to end of file * Add gc collect to make sure gpu tensor is deallocated (triton-inference-server#6205) * Testing: add gc collect to make sure gpu tensor is deallocated * Address comment * Check for log error on failing to find explicit load model (triton-inference-server#6204) * Set default shm size to 1MB for Python backend (triton-inference-server#6209) * Trace Model Name Validation (triton-inference-server#6199) * Initial commit * Cleanup using new standard formatting * QA test restructuring * Add newline to the end of test.sh * HTTP/GRCP protocol changed to pivot on ready status & error status. Log file name changed in qa test. * Fixing unhandled error memory leak * Handle index function memory leak fix * Fix the check for error message (triton-inference-server#6226) * Fix copyright for debugging guide (triton-inference-server#6225) * Add watts units to GPU power metric descriptions (triton-inference-server#6242) * Update post-23.08 release (triton-inference-server#6234) * CUDA 12.1 > 12.2 * DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors * Revert "DLIS-5208: onnxruntime+windows - stop treat warnings on compile as errors" This reverts commit 0cecbb7. * Update Dockerfile.win10.min * Update Dockerfile.win10.min * Update README and versions for 23.08 branch * Update Dockerfile.win10 * Fix the versions in docs * Add the note about stabilization of the branch * Update docs with NVAIE messaging (triton-inference-server#6162) (triton-inference-server#6167) Update docs with NVAIE messaging Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com> * Resolve merge conflict --------- Co-authored-by: tanmayv25 <tanmay2592@gmail.com> Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com> * Add tests/docs for queue size (pending request count) metric (triton-inference-server#6233) * Adding safe string to number conversions (triton-inference-server#6173) * Added catch for out of range error for trace setting update * Added wrapper to safe parse options * Added option names to errors * Adjustments * Quick fix * Fixing option name for Windows * Removed repetitive code * Adjust getopt_long for Windows to use longindex * Moved try catch into ParseOption * Removed unused input * Improved names * Refactoring and clean up * Fixed Windows * Refactored getopt_long for Windows * Refactored trace test, pinned otel's collector version to avoid problems with go requirements * Test Python execute() to return Triton error code (triton-inference-server#6228) * Add test for Python execute error code * Add all supported error codes into test * Move ErrorCode into TritonError * Expose ErrorCode internal in TritonError * Add docs on IPv6 (triton-inference-server#6262) * Add test for TensorRT version-compatible model support (triton-inference-server#6255) * Add tensorrt version-compatibility test * Generate one version-compatible model * Fix copyright year * Remove unnecessary variable * Remove unnecessary line * Generate TRT version-compatible model * Add sample inference to TRT version-compatible test * Clean up utils and model gen for new plan model * Fix startswith capitalization * Remove unused imports * Remove unused imports * Add log check * Upgrade protobuf version (triton-inference-server#6268) * Add testing for retrieving shape and datatype in backend API (triton-inference-server#6231) Add testing for retrieving output shape and datatype info from backend API * Update 'main' to track development of 2.39.0 / 23.10 (triton-inference-server#6277) * Apply UCX workaround (triton-inference-server#6254) * Add ensemble parameter forwarding test (triton-inference-server#6284) * Exclude extra TRT version-compatible models from tests (triton-inference-server#6294) * Exclude compatible models from tests. * Force model removal, in case it does not exist Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Adding installation of docker and docker-buildx (triton-inference-server#6299) * Adding installation of docker and docker-buildx * remove whitespace * Use targetmodel from header as model name in SageMaker (triton-inference-server#6147) * Use targetmodel from header as model name in SageMaker * Update naming for model hash * Add more error messages, return codes, and refactor HTTP server (triton-inference-server#6297) * Fix typo (triton-inference-server#6318) * Update the request re-use example (triton-inference-server#6283) * Update the request re-use example * Review edit * Review comment * Disable developer tools build for In-process API + JavaCPP tests (triton-inference-server#6296) * Add Python binding build. Add L0_python_api to test Python binding (triton-inference-server#6319) * Add L0_python_api to test Python binding * Install Python API in CI image * Fix QA build * Increase network timeout for valgrind (triton-inference-server#6324) * Tests and docs for ability to specify subdirectory to download for LocalizePath (triton-inference-server#6308) * Added custom localization tests for s3 and azure, added docs * Refactor HandleInfer into more readable chunks (triton-inference-server#6332) * Refactor model generation scripts (triton-inference-server#6336) * Refactor model generation scripts * Fix codeql * Fix relative path import * Fix package structure * Copy the gen_common file * Add missing uint8 * Remove duplicate import * Add testing for scalar I/O in ORT backend (triton-inference-server#6343) * Add testing for scalar I/O in ORT backend * Review edit * ci * Update post-23.09 release (triton-inference-server#6367) * Update README and versions for 23.09 branch (triton-inference-server#6280) * Update `Dockerfile` and `build.py` (triton-inference-server#6281) * Update configuration for Windows Dockerfile (triton-inference-server#6256) * Adding installation of docker and docker-buildx * Enable '--expt-relaxed-constexpr' flag for custom ops models * Upate Dockerfile version * Disable unit tests for Jetson * Update condition (triton-inference-server#6285) * removing Whitespaces (triton-inference-server#6293) * removing Whitespaces * removing whitespaces * Add security policy (triton-inference-server#6376) * Adding client-side request cancellation support and testing (triton-inference-server#6383) * Add L0_request_cancellation (triton-inference-server#6252) * Add L0_request_cancellation * Remove unittest test * Add cancellation to gRPC server error handling * Fix up * Use identity model * Add tests for gRPC client-side cancellation (triton-inference-server#6278) * Add tests for gRPC client-side cancellation * Fix CodeQL issues * Formatting * Update qa/L0_client_cancellation/client_cancellation_test.py Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Move to L0_request_cancellation * Address review comments * Removing request cancellation support from asyncio version * Format * Update copyright * Remove tests * Handle cancellation notification in gRPC server (triton-inference-server#6298) * Handle cancellation notification in gRPC server * Fix the request ptr initialization * Update src/grpc/infer_handler.h Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Address review comment * Fix logs * Fix request complete callback by removing reference to state * Improve documentation --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Fixes on the gRPC frontend to handle AsyncNotifyWhenDone() API (triton-inference-server#6345) * Fix segmentation fault in gRPC frontend * Finalize all states upon completion * Fixes all state cleanups * Handle completed states when cancellation notification is received * Add more documentation steps * Retrieve dormant states to minimize the memory footprint for long streams * Update src/grpc/grpc_utils.h Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Use a boolean state instead of raw pointer --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Add L0_grpc_state_cleanup test (triton-inference-server#6353) * Add L0_grpc_state_cleanup test * Add model file in QA container * Fix spelling * Add remaining subtests * Add failing subtests * Format fixes * Fix model repo * Fix QA docker file * Remove checks for the error message when shutting down server * Fix spelling * Address review comments * Add schedulers request cancellation tests (triton-inference-server#6309) * Add schedulers request cancellation tests * Merge gRPC client test * Reduce testing time and covers cancelling other requests as a consequence of request cancellation * Add streaming request cancellation test --------- Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Add missing copyright (triton-inference-server#6388) * Add basic generate endpoints for LLM tasks (triton-inference-server#6366) * PoC of parsing request prompt and converting to Triton infer request * Remove extra trace * Add generate endpoint * Enable streaming version * Fix bug * Fix up * Add basic testing. Cherry pick from triton-inference-server#6369 * format * Address comment. Fix build * Minor cleanup * cleanup syntax * Wrap error in SSE format * Fix up * Restrict number of response on non-streaming generate * Address comment on implementation. * Re-enable trace on generate endpoint * Add more comprehensive llm endpoint tests (triton-inference-server#6377) * Add security policy (triton-inference-server#6376) * Start adding some more comprehensive tests * Fix test case * Add response error testing * Complete test placeholder * Address comment * Address comments * Fix code check --------- Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: GuanLuo <gluo@nvidia.com> * Address comment * Address comment * Address comment * Fix typo --------- Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> * Add Python backend request cancellation test (triton-inference-server#6364) * Add cancelled response status test * Add Python backend request cancellation test * Add Python backend decoupled request cancellation test * Simplified response if cancelled * Test response_sender.send() after closed * Rollback test response_sender.send() after closed * Rollback non-decoupled any response on cancel * Add TRT-LLM backend build to Triton (triton-inference-server#6365) (triton-inference-server#6392) * Add TRT-LLM backend build to Triton (triton-inference-server#6365) * Add trtllm backend to build * Temporarily adding version map for 23.07 * Fix build issue * Update comment * Comment out python binding changes * Add post build * Update trtllm backend naming * Update TRTLLM base image * Fix cmake arch * Revert temp changes for python binding PR * Address comment * Move import to the top (triton-inference-server#6395) * Move import to the top * pre commit format * Add Python backend when vLLM backend built (triton-inference-server#6397) * Update build.py to build vLLM backend (triton-inference-server#6394) * Support parameters object in generate route * Update 'main' to track development of 2.40.0 / 23.11 (triton-inference-server#6400) * Fix L0_sdk (triton-inference-server#6387) * Add documentation on request cancellation (triton-inference-server#6403) * Add documentation on request cancellation * Include python backend * Update docs/user_guide/request_cancellation.md Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> * Update docs/user_guide/request_cancellation.md Co-authored-by: Neelay Shah <neelays@nvidia.com> * Update docs/README.md Co-authored-by: Neelay Shah <neelays@nvidia.com> * Update docs/user_guide/request_cancellation.md Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Remove inflight term from the main documentation * Address review comments * Fix * Update docs/user_guide/request_cancellation.md Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Fix --------- Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Fixes in request cancellation doc (triton-inference-server#6409) * Document generate HTTP endpoint (triton-inference-server#6412) * Document generate HTTP endpoint * Address comment * Fix up * format * Address comment * Update SECURITY.md to not display commented copyright (triton-inference-server#6426) * Fix missing library in L0_data_compression (triton-inference-server#6424) * Fix missing library in L0_data_compression * Fix up * Add Javacpp-presets repo location as env variable in Java tests(triton-inference-server#6385) Simplify testing when upstream (javacpp-presets) build changes. Related to triton-inference-server/client#409 * TRT-LLM backend build changes (triton-inference-server#6406) * Update url * Debugging * Debugging * Update url * Fix build for TRT-LLM backend * Remove TRTLLM TRT and CUDA versions * Fix up unused var * Fix up dir name * FIx cmake patch * Remove previous TRT version * Install required packages for example models * Remove packages that are only needed for testing * Add gRPC AsyncIO request cancellation tests (triton-inference-server#6408) * Fix gRPC test failure and refactor * Add gRPC AsyncIO cancellation tests * Better check if a request is cancelled * Use f-string * Fix L0_implicit_state (triton-inference-server#6427) * Fixing vllm build (triton-inference-server#6433) * Fixing torch version for vllm * Switch Jetson model TensorRT models generation to container (triton-inference-server#6378) * Switch Jetson model TensorRT models generation to container * Adding missed file * Fix typo * Fix typos * Remove extra spaces * Fix typo * Bumped vllm version (triton-inference-server#6444) * Adjust test_concurrent_same_model_load_unload_stress (triton-inference-server#6436) * Adding emergency vllm latest release (triton-inference-server#6454) * Fix notify state destruction and inflight states tracking (triton-inference-server#6451) * Ensure notify_state_ gets properly destructed * Fix inflight state tracking to properly erase states * Prevent removing the notify_state from being erased * Wrap notify_state_ object within unique_ptr * Update TRT-LLM backend url (triton-inference-server#6455) * TRTLLM backend post release * TRTLLM backend post release * Update submodule url for permission issue * Update submodule url * Fix up * Not using postbuild function to workaround submodule url permission issue * Added docs on python based backends (triton-inference-server#6429) Co-authored-by: Neelay Shah <neelays@nvidia.com> * L0_model_config Fix (triton-inference-server#6472) * Minor fix for L0_model_config * Add test for Python model parameters (triton-inference-server#6452) * Test Python BLS with different sizes of CUDA memory pool (triton-inference-server#6276) * Test with different sizes of CUDA memory pool * Check the server log for error message * Improve debugging * Fix syntax * Add documentation for K8s-onprem StartupProbe (triton-inference-server#5257) Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> * Update `main` post-23.10 release (triton-inference-server#6484) * Update README and versions for 23.10 branch (triton-inference-server#6399) * Cherry-picking vLLM backend changes (triton-inference-server#6404) * Update build.py to build vLLM backend (triton-inference-server#6394) * Add Python backend when vLLM backend built (triton-inference-server#6397) --------- Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> * Add documentation on request cancellation (triton-inference-server#6403) (triton-inference-server#6407) * Add documentation on request cancellation * Include python backend * Update docs/user_guide/request_cancellation.md * Update docs/user_guide/request_cancellation.md * Update docs/README.md * Update docs/user_guide/request_cancellation.md * Remove inflight term from the main documentation * Address review comments * Fix * Update docs/user_guide/request_cancellation.md * Fix --------- Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> * Fixes in request cancellation doc (triton-inference-server#6409) (triton-inference-server#6410) * TRT-LLM backend build changes (triton-inference-server#6406) (triton-inference-server#6430) * Update url * Debugging * Debugging * Update url * Fix build for TRT-LLM backend * Remove TRTLLM TRT and CUDA versions * Fix up unused var * Fix up dir name * FIx cmake patch * Remove previous TRT version * Install required packages for example models * Remove packages that are only needed for testing * Fixing vllm build (triton-inference-server#6433) (triton-inference-server#6437) * Fixing torch version for vllm Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> * Update TRT-LLM backend url (triton-inference-server#6455) (triton-inference-server#6460) * TRTLLM backend post release * TRTLLM backend post release * Update submodule url for permission issue * Update submodule url * Fix up * Not using postbuild function to workaround submodule url permission issue * remove redundant lines * Revert "remove redundant lines" This reverts commit 86be7ad. * restore missed lines * Update build.py Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> * Update build.py Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> --------- Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Kris Hung <krish@nvidia.com> Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com> Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> * Adding structure reference to the new document (triton-inference-server#6493) * Improve L0_backend_python test stability (ensemble / gpu_tensor_lifecycle) (triton-inference-server#6490) * Test torch allocator gpu memory usage directly rather than global gpu memory for more consistency * Add L0_generative_sequence test (triton-inference-server#6475) * Add testing backend and test * Add test to build / CI. Minor fix on L0_http * Format. Update backend documentation * Fix up * Address comment * Add negative testing * Fix up * Downgrade vcpkg version (triton-inference-server#6503) * Collecting sub dir artifacts in GitLab yaml. Removing collect function from test script. (triton-inference-server#6499) * Use post build function for TRT-LLM backend (triton-inference-server#6476) * Use postbuild function * Remove updating submodule url * Enhanced python_backend autocomplete (triton-inference-server#6504) * Added testing for python_backend autocomplete: optional input and model_transaction_policy * Parse reuse-grpc-port and reuse-http-port as booleans (triton-inference-server#6511) Co-authored-by: Francesco Petrini <francescogpetrini@gmail.com> * Fixing L0_io (triton-inference-server#6510) * Fixing L0_io * Add Python-based backends CI (triton-inference-server#6466) * Bumped vllm version * Add python-bsed backends testing * Add python-based backends CI * Fix errors * Add vllm backend * Fix pre-commit * Modify test.sh * Remove vllm_opt qa model * Remove vLLM ackend tests * Resolve review comments * Fix pre-commit errors * Update qa/L0_backend_python/python_based_backends/python_based_backends_test.py Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> * Remove collect_artifacts_from_subdir function call --------- Co-authored-by: oandreeva-nv <oandreeva@nvidia.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> * Enabling option to restrict access to HTTP APIs based on header value pairs (similar to gRPC) * Upgrade DCGM from 2.4.7 to 3.2.6 (triton-inference-server#6515) * Enhance GCS credentials documentations (triton-inference-server#6526) * Test file override outside of model directory (triton-inference-server#6516) * Add boost-filesystem * Update ORT version to 1.16.2 (triton-inference-server#6531) * Adjusting expected error msg (triton-inference-server#6517) * Update 'main' to track development of 2.41.0 / 23.12 (triton-inference-server#6543) * Enhance testing for pending request count (triton-inference-server#6532) * Enhance testing for pending request count * Improve the documentation * Add more documentation * Add testing for Python backend request rescheduling (triton-inference-server#6509) * Add testing * Fix up * Enhance testing * Fix up * Revert test changes * Add grpc endpoint test * Remove unused import * Remove unused import * Update qa/L0_backend_python/request_rescheduling/grpc_endpoint_test.py Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> * Update qa/python_models/bls_request_rescheduling/model.py Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> --------- Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> * Check that the wget is installed (triton-inference-server#6556) * secure deployment considerations guide (triton-inference-server#6533) * draft document * updates * updates * updated * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * update * updates * updates * Update docs/customization_guide/deploy.md Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com> * Update docs/customization_guide/deploy.md Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com> * fixing typos * updated with clearer warnings * updates to readme and toc --------- Co-authored-by: Kyle McGill <101670481+nv-kmcgill53@users.noreply.github.com> * Fix typo and change the command line order (triton-inference-server#6557) * Fix typo and change the command line order * Improve visual experience. Add 'clang' package * Add error during rescheduling test to L0_generative_sequence (triton-inference-server#6550) * changing references to concrete instances * Add testing for implicit state enhancements (triton-inference-server#6524) * Add testing for single buffer * Add testing for implicit state with buffer growth * Improve testing * Fix up * Add CUDA virtual address size flag * Add missing test files * Parameter rename * Test fixes * Only build implicit state backend for GPU=ON * Fix copyright (triton-inference-server#6584) * Mention TRT LLM backend supports request cancellation (triton-inference-server#6585) * update model repository generation for onnx models for protobuf (triton-inference-server#6575) * Fix L0_sagemaker (triton-inference-server#6587) * Add C++ server wrapper to the doc (triton-inference-server#6592) * Add timeout to client apis and tests (triton-inference-server#6546) Client PR: triton-inference-server/client#429 * Change name generative -> iterative (triton-inference-server#6601) * name changes * updated names * Add documentation on generative sequence (triton-inference-server#6595) * Add documentation on generative sequence * Address comment * Reflect the "iterative" change * Updated description of iterative sequences * Restricted HTTP API documentation Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> * Add request cancellation and debugging guide to generated docs (triton-inference-server#6617) * Support for http request cancellation. Includes fix for seg fault in generate_stream endpoint. * Bumped vLLM version to v0.2.2 (triton-inference-server#6623) * Upgrade ORT version (triton-inference-server#6618) * Use compliant preprocessor (triton-inference-server#6626) * Update README.md (triton-inference-server#6627) * Extend request objects lifetime and fixes possible segmentation fault (triton-inference-server#6620) * Extend request objects lifetime * Remove explicit TRITONSERVER_InferenceRequestDelete * Format fix * Include the inference_request_ initialization to cover RequestNew --------- Co-authored-by: Neelay Shah <neelays@nvidia.com> * Update protobuf after python update for testing (triton-inference-server#6638) This fixes the issue where python client has `AttributeError: 'NoneType' object has no attribute 'enum_types_by_name' errors after python version is updated. * Update post-23.11 release (triton-inference-server#6653) * Update README and versions for 2.40.0 / 23.11 (triton-inference-server#6544) * Removing path construction to use SymLink alternatives * Update version for PyTorch * Update windows Dockerfile configuration * Update triton version to 23.11 * Update README and versions for 2.40.0 / 23.11 * Fix typo * Ading 'ldconfig' to configure dynamic linking in container (triton-inference-server#6602) * Point to tekit_backend (triton-inference-server#6616) * Point to tekit_backend * Update version * Revert tekit changes (triton-inference-server#6640) --------- Co-authored-by: Kris Hung <krish@nvidia.com> * PYBE Timeout Tests (triton-inference-server#6483) * New testing to confirm large request timeout values can be passed and retrieved within Python BLS models. * Add note on lack of ensemble support (triton-inference-server#6648) * Added request id to span attributes (triton-inference-server#6667) * Add test for optional internal tensor within an ensemble (triton-inference-server#6663) * Add test for optional internal tensor within an ensemble * Fix up * Set CMake version to 3.27.7 (triton-inference-server#6675) * Set CMake version to 3.27.7 * Set CMake version to 3.27.7 * Fix double slash typo * restore typo (triton-inference-server#6680) * Update 'main' to track development of 2.42.0 / 24.01 (triton-inference-server#6673) * iGPU build refactor (triton-inference-server#6684) (triton-inference-server#6691) * Mlflow Plugin Fix (triton-inference-server#6685) * Mlflow plugin fix * Fix extra content-type headers in HTTP server (triton-inference-server#6678) * Fix iGPU CMakeFile tags (triton-inference-server#6695) * Unify iGPU test build with x86 ARM * adding TRITON_IGPU_BUILD to core build definition; adding logic to skip caffe2plan test if TRITON_IGPU_BUILD=1 * re-organizing some copies in Dockerfile.QA to fix igpu devel build * Pre-commit fix --------- Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com> * adding default value for TRITON_IGPU_BUILD=OFF (triton-inference-server#6705) * adding default value for TRITON_IGPU_BUILD=OFF * fix newline --------- Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com> * Add test case for decoupled model raising exception (triton-inference-server#6686) * Add test case for decoupled model raising exception * Remove unused import * Address comment * Escape special characters in general docs (triton-inference-server#6697) * vLLM Benchmarking Test (triton-inference-server#6631) * vLLM Benchmarking Test * Allow configuring GRPC max connection age and max connection age grace (triton-inference-server#6639) * Add ability to configure GRPC max connection age and max connection age grace * Allow pass GRPC connection age args when they are set from command ---------- Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com> --------- Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com> Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com> Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: Neelay Shah <neelays@nvidia.com> Co-authored-by: Tanmay Verma <tanmay2592@gmail.com> Co-authored-by: Kris Hung <krish@nvidia.com> Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Co-authored-by: dyastremsky <58150256+dyastremsky@users.noreply.github.com> Co-authored-by: Katherine Yang <80359429+jbkyang-nvi@users.noreply.github.com> Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com> Co-authored-by: Gerard Casas Saez <gerardc@squareup.com> Co-authored-by: Misha Chornyi <99709299+mc-nv@users.noreply.github.com> Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com> Co-authored-by: Elias Bermudez <6505145+debermudez@users.noreply.github.com> Co-authored-by: ax-vivien <113907557+ax-vivien@users.noreply.github.com> Co-authored-by: Neelay Shah <neelays@neelays-dt.nvidia.com> Co-authored-by: nv-kmcgill53 <101670481+nv-kmcgill53@users.noreply.github.com> Co-authored-by: Matthew Kotila <matthew.r.kotila@gmail.com> Co-authored-by: Nikhil Kulkarni <knikhil29@gmail.com> Co-authored-by: Misha Chornyi <mchornyi@nvidia.com> Co-authored-by: Iman Tabrizian <itabrizian@nvidia.com> Co-authored-by: David Yastremsky <dyastremsky@nvidia.com> Co-authored-by: Timothy Gerdes <50968584+tgerdesnv@users.noreply.github.com> Co-authored-by: Mate Mijolović <mate.mijolovic@gmail.com> Co-authored-by: David Zier <42390249+dzier@users.noreply.github.com> Co-authored-by: Hyunjae Woo <107147848+nv-hwoo@users.noreply.github.com> Co-authored-by: Tanay Varshney <tvarshney@nvidia.com> Co-authored-by: Francesco Petrini <francescogpetrini@gmail.com> Co-authored-by: Dmitry Mironov <dmitrym@nvidia.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Co-authored-by: Sai Kiran Polisetty <spolisetty@nvidia.com> Co-authored-by: oandreeva-nv <oandreeva@nvidia.com> Co-authored-by: kyle <kmcgill@kmcgill-ubuntu.nvidia.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com> Co-authored-by: siweili11 <152239970+siweili11@users.noreply.github.com>
ParseJsonTritonIO --> {inputs: {name, datatype, shape}, ...}
and accepting json inputs in other formats in future such as{prompt, parameters}
.EVBufferToInput
function into more manageable chunkscurl
UX, similar to FastAPI and vLLM API server experience.Error Message UX Details
Before
Query triton routes:
Responses are blank, so you have to remember to grab the error code:
But then you have to go look up what 405 means, and so on
After
Query triton routes:
Now it's more immediately obvious what's wrong with our queries.
FastAPI/vLLM for comparison