Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
* Add quick start documentation page (triton-inference-server#261)

* Add quick start documentation page

* Addressed comments

* Addressed comments

* Fixed typo

* Addressed comments

* Add installation documentation page (triton-inference-server#266)

* Add installation documentation page

* Add quick start to docs readme

* Add CLI documentation page (triton-inference-server#271)

* Add CLI documentation page

* Addressed comments

* Addressed comments

* Clean up

* Clean up

* Rewrite top-level readme (triton-inference-server#273)

* Rewrite top-level readme

* Addressed comments

* Add input data, measurements, benchmarking documentation (triton-inference-server#274)

* Add data guide documentation

* Add measurement doc

* Add benchmarking doc

* Add more to measurements and metrics doc and new link in readme

* Fix some comments and add more to metrics

* Move data_guide to input_data

* Adjusted header size

* Update all links

* Add back communication protocol docs, fix links, add bls composing models option (triton-inference-server#287)

* Fix various things in docs (triton-inference-server#288)

* Fix various things in docs

* Addressed comments

* Proof-read all docs (triton-inference-server#290)

* Proof-read all docs

* Addressed comments

* Fix docs

---------

Co-authored-by: Matthew Kotila <matthew.r.kotila@gmail.com>
  • Loading branch information
debermudez and matthewkotila authored Apr 20, 2023
1 parent 2147e39 commit ddab817
Show file tree
Hide file tree
Showing 10 changed files with 1,837 additions and 699 deletions.
812 changes: 115 additions & 697 deletions src/c++/perf_analyzer/README.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions src/c++/perf_analyzer/command_line_parser.cc
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,8 @@ CLParser::Usage(const std::string& msg)
"{\"data\" : [{\"TORCHSERVE_INPUT\" : [\"<complete path to the "
"content file>\"]}, {...}...]}. The type of file here will depend "
"on the model. In order to use \"triton_c_api\" you must specify "
"the Triton server install path and the model repository "
"path via the --library-name and --model-repo flags",
"the Triton server install path and the model repository path via "
"the --triton-server-directory and --model-repository flags",
18)
<< std::endl;

Expand Down
54 changes: 54 additions & 0 deletions src/c++/perf_analyzer/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
<!--
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# **Perf Analyzer Documentation**

| [Installation](README.md#installation) | [Getting Started](README.md#getting-started) | [User Guide](README.md#user-guide) |
| -------------------------------------- | -------------------------------------------- | ---------------------------------- |

## **Installation**

See the [Installation Guide](install.md) for details on how to install Perf
Analyzer.

## **Getting Started**

The [Quick Start Guide](quick_start.md) will show you how to use Perf
Analyzer to profile a simple PyTorch model.

## **User Guide**

The User Guide describes the Perf Analyzer command line options, how to specify
model input data, the performance measurement modes, the performance metrics and
outputs, how to benchmark different servers, and more.

- [Perf Analyzer CLI](cli.md)
- [Inference Load Modes](inference_load_modes.md)
- [Input Data](input_data.md)
- [Measurements & Metrics](measurements_metrics.md)
- [Benchmarking](benchmarking.md)
250 changes: 250 additions & 0 deletions src/c++/perf_analyzer/docs/benchmarking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
<!--
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# Benchmarking Triton via HTTP or gRPC endpoint

This is the default mode for Perf Analyzer.

# Benchmarking Triton directly via C API

Besides using HTTP or gRPC server endpoints to communicate with Triton, Perf
Analyzer also allows users to benchmark Triton directly using the C API. HTTP
and gRPC endpoints introduce an additional latency in the pipeline which may not
be of interest to users who are using Triton via C API within their application.
Specifically, this feature is useful to benchmark a bare minimum Triton without
additional overheads from HTTP/gRPC communication.

## Prerequisite

Pull the Triton SDK and the Triton Server container images on target machine.
Since you will need access to the `tritonserver` install, it might be easier if
you copy the `perf_analyzer` binary to the Inference Server container.

## Required parameters

Use the `--help` option to see a complete list of supported command line
arguments. By default, Perf Analyzer expects the Triton instance to already be
running. You can configure C API mode using the
[`--service-kind`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
option. In additon, you will need to point Perf Analyzer to the Triton server
library path using the
[`--triton-server-directory`](cli.md#--triton-server-directorypath) option and
the model repository path using the
[`--model-repository`](cli.md#--model-repositorypath) option.

An example run would look like:

```
$ perf_analyzer -m my_model --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/my/model/repository
...
*** Measurement Settings ***
Service Kind: Triton C-API
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 353
Throughput: 19.6095 infer/sec
Avg latency: 50951 usec (standard deviation 2265 usec)
p50 latency: 50833 usec
p90 latency: 50923 usec
p95 latency: 50940 usec
p99 latency: 50985 usec
Server:
Inference count: 353
Execution count: 353
Successful request count: 353
Avg request latency: 50841 usec (overhead 20 usec + queue 63 usec + compute input 35 usec + compute infer 50663 usec + compute output 59 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 19.6095 infer/sec, latency 50951 usec
```

## Non-supported functionalities

There are a few functionalities that are missing from C API mode. They are:

1. Async mode ([`--async`](cli.md#--async))
2. For additonal known non-working cases, please refer to
[qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277)

# Benchmarking TensorFlow Serving

Perf Analyzer can also be used to benchmark models deployed on
[TensorFlow Serving](https://github.com/tensorflow/serving) using the
[`--service-kind=tfserving`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
option. Only gRPC protocol is supported.

The following invocation demonstrates how to configure Perf Analyzer to issue
requests to a running instance of `tensorflow_model_server`:

```
$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 829
Throughput: 165.8 infer/sec
Avg latency: 6032 usec (standard deviation 569 usec)
p50 latency: 5863 usec
p90 latency: 6655 usec
p95 latency: 6974 usec
p99 latency: 8093 usec
Avg gRPC time: 5984 usec ((un)marshal request/response 257 usec + response wait 5727 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec
```

You might have to specify a different url ([`-u`](cli.md#-u-url)) to access
wherever the server is running. The report of Perf Analyzer will only include
statistics measured at the client-side.

**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee
optimal tuning for TensorFlow Serving. However, a single benchmarking tool that
can be used to stress the inference servers in an identical manner is important
for performance analysis.

The following points are important for interpreting the results:

1. `Concurrent Request Execution`:
TensorFlow Serving (TFS), as of version 2.8.0, by default creates threads for
each request that individually submits requests to TensorFlow Session. There
is a resource limit on the number of concurrent threads serving requests.
When benchmarking at a higher request concurrency, you can see higher
throughput because of this. Unlike TFS, by default Triton is configured with
only a single
[instance count](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups).
Hence, at a higher request concurrency, most of the requests are blocked on
the instance availability. To configure Triton to behave like TFS, set the
instance count to a reasonably high value and then set
[MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters)
parameter in the model `config.pbtxt` to the same value. For some context,
the TFS sets its thread constraint to four times the num of schedulable CPUs.
2. `Different library versions`:
The version of TensorFlow might differ between Triton and TensorFlow Serving
being benchmarked. Even the versions of CUDA libraries might differ between
the two solutions. The performance of models can be susceptible to the
versions of these libraries. For a single request concurrency, if the
`compute_infer` time reported by Perf Analyzer when benchmarking Triton is as
large as the latency reported by Perf Analyzer when benchmarking TFS, then
the performance difference is likely because of the difference in the
software stack and outside the scope of Triton.
3. `CPU Optimization`:
TFS has separate builds for CPU and GPU targets. They have target-specific
optimization. Unlike TFS, Triton has a single build which is optimized for
execution on GPUs. When collecting performance on CPU models on Triton, try
running Triton with the environment variable `TF_ENABLE_ONEDNN_OPTS=1`.

# Benchmarking TorchServe

Perf Analyzer can also be used to benchmark
[TorchServe](https://github.com/pytorch/serve) using the
[`--service-kind=torchserve`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
option. Only HTTP protocol is supported. It also requires input to be provided
via JSON file.

The following invocation demonstrates how to configure Perf Analyzer to issue
requests to a running instance of `torchserve` assuming the location holds
`kitten_small.jpg`:

```
$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json
Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 799
Throughput: 159.8 infer/sec
Avg latency: 6259 usec (standard deviation 397 usec)
p50 latency: 6305 usec
p90 latency: 6448 usec
p95 latency: 6494 usec
p99 latency: 7158 usec
Avg HTTP time: 6272 usec (send/recv 77 usec + response wait 6195 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec
```

The content of `data.json`:

```json
{
"data" :
[
{
"TORCHSERVE_INPUT" : ["kitten_small.jpg"]
}
]
}
```

You might have to specify a different url ([`-u`](cli.md#-u-url)) to access
wherever the server is running. The report of Perf Analyzer will only include
statistics measured at the client-side.

**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee
optimal tuning for TorchServe. However, a single benchmarking tool that can be
used to stress the inference servers in an identical manner is important for
performance analysis.

# Advantages of using Perf Analyzer over third-party benchmark suites

Triton Inference Server offers the entire serving solution which includes
[client libraries](https://github.com/triton-inference-server/client) that are
optimized for Triton. Using third-party benchmark suites like `jmeter` fails to
take advantage of the optimized libraries. Some of these optimizations includes
but are not limited to:

1. Using
[binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md#binary-tensor-data-extension)
with HTTP requests.
2. Effective re-use of gRPC message allocation in subsequent requests.
3. Avoiding extra memory copy via libcurl interface.

These optimizations can have a tremendous impact on overall performance. Using
Perf Analyzer for benchmarking directly allows a user to access these
optimizations in their study.

Not only that, Perf Analyzer is also very customizable and supports many Triton
features as described in this document. This, along with a detailed report,
allows a user to identify performance bottlenecks and experiment with different
features before deciding upon what works best for them.
Loading

0 comments on commit ddab817

Please sign in to comment.