Skip to content

Commit

Permalink
Proof-read all docs
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewkotila committed Apr 17, 2023
1 parent d2bdf00 commit 63e6df4
Show file tree
Hide file tree
Showing 8 changed files with 450 additions and 394 deletions.
2 changes: 1 addition & 1 deletion src/c++/perf_analyzer/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
Copyright 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
Expand Down
2 changes: 1 addition & 1 deletion src/c++/perf_analyzer/docs/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
Expand Down
204 changes: 121 additions & 83 deletions src/c++/perf_analyzer/docs/benchmarking.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
<!--
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
Expand All @@ -11,6 +12,7 @@ are met:
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
Expand All @@ -30,47 +32,85 @@ This is the default mode for Perf Analyzer.

# Benchmarking Triton directly via C API

Besides using HTTP or gRPC server endpoints to communicate with Triton, perf_analyzer also allows user to benchmark Triton directly using C API. HTTP/gRPC endpoints introduce an additional latency in the pipeline which may not be of interest to the user who is using Triton via C API within their application. Specifically, this feature is useful to benchmark bare minimum Triton without additional overheads from HTTP/gRPC communication.
Besides using HTTP or gRPC server endpoints to communicate with Triton, Perf
Analyzer also allows users to benchmark Triton directly using the C API. HTTP
and gRPC endpoints introduce an additional latency in the pipeline which may not
be of interest to users who are using Triton via C API within their application.
Specifically, this feature is useful to benchmark a bare minimum Triton without
additional overheads from HTTP/gRPC communication.

## Prerequisite

Pull the Triton SDK and the Inference Server container images on target machine.
Since you will need access to the Tritonserver install, it might be easier if
you copy the perf_analyzer binary to the Inference Server container.
Since you will need access to the `tritonserver` install, it might be easier if
you copy the `perf_analyzer` binary to the Inference Server container.

## Required Parameters
## Required parameters

Use the `--help` option to see a complete list of supported command line
arguments. By default, Perf Analyzer expects the Triton instance to already be
running. You can configure C API mode using the
[`--service-kind`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
option. In additon, you will need to point Perf Analyzer to the Triton server
library path using the
[`--triton-server-directory`](cli.md#--triton-server-directorypath) option and
the model repository path using the
[`--model-repository`](cli.md#--model-repositorypath) option.

Use the --help option to see complete list of supported command line arguments.
By default perf_analyzer expects the Triton instance to already be running. You can configure the C API mode using the [`--service-kind`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve) option. In additon, you will need to point
perf_analyzer to the Triton server library path using the [`--triton-server-directory`](cli.md#--triton-server-directorypath) option and the model
repository path using the [`--model-repository`](cli.md#--model-repositorypath) option.
If the server is run successfully, there is a prompt: "server is alive!" and perf_analyzer will print the stats, as normal.
An example run would look like:

```
perf_analyzer -m graphdef_int32_int32_int32 --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/workspace/qa/L0_perf_analyzer_capi/models
$ perf_analyzer -m my_model --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/my_model_repository
...
*** Measurement Settings ***
Service Kind: Triton C-API
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 353
Throughput: 19.6095 infer/sec
Avg latency: 50951 usec (standard deviation 2265 usec)
p50 latency: 50833 usec
p90 latency: 50923 usec
p95 latency: 50940 usec
p99 latency: 50985 usec
Server:
Inference count: 353
Execution count: 353
Successful request count: 353
Avg request latency: 50841 usec (overhead 20 usec + queue 63 usec + compute input 35 usec + compute infer 50663 usec + compute output 59 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 19.6095 infer/sec, latency 50951 usec
```

## Non-supported functionalities

There are a few functionalities that are missing from the C API. They are:
There are a few functionalities that are missing from C API mode. They are:

1. Async mode (`-a`)
2. Using shared memory mode ([`--shared-memory=cuda`](cli.md#--shared-memorynonesystemcuda) or `--shared-memory=system`)
3. Request rate range mode
1. Async mode ([`--async`](cli.md#--async))
2. Shared memory mode
([`--shared-memory=cuda`](cli.md#--shared-memorynonesystemcuda) or
[`--shared-memory=system`](cli.md#--shared-memorynonesystemcuda))
3. Request rate mode
([`--request-rate-range`](cli.md#--request-rate-rangestartendstep))
4. For additonal known non-working cases, please refer to
[qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277)

# Benchmarking TensorFlow Serving

perf_analyzer can also be used to benchmark models deployed on
[TensorFlow Serving](https://github.com/tensorflow/serving) using
the [`--service-kind`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve) option. The support is however only available
through gRPC protocol.
Perf Analyzer can also be used to benchmark models deployed on
[TensorFlow Serving](https://github.com/tensorflow/serving) using the
[`--service-kind=tfserving`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
option. Only gRPC protocol is supported.

Following invocation demonstrates how to configure perf_analyzer
to issue requests to a running instance of
`tensorflow_model_server`:
The following invocation demonstrates how to configure Perf Analyzer to issue
requests to a running instance of `tensorflow_model_server`:

```
$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500
Expand All @@ -94,60 +134,57 @@ Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec
```

You might have to specify a different url(`-u`) to access wherever
the server is running. The report of perf_analyzer will only
include statistics measured at the client-side.
You might have to specify a different url ([`-u`](cli.md#-u-url)) to access
wherever the server is running. The report of Perf Analyzer will only include
statistics measured at the client-side.

**NOTE:** The support is still in **beta**. perf_analyzer does
not guarantee optimum tuning for TensorFlow Serving. However, a
single benchmarking tool that can be used to stress the inference
servers in an identical manner is important for performance
analysis.
**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee
optimal tuning for TensorFlow Serving. However, a single benchmarking tool that
can be used to stress the inference servers in an identical manner is important
for performance analysis.

The following points are important for interpreting the results:

1. `Concurrent Request Execution`:
TensorFlow Serving (TFS), as of version 2.8.0, by default creates
threads for each request that individually submits requests to
TensorFlow Session. There is a resource limit on the number of
concurrent threads serving requests. When benchmarking at a higher
request concurrency, you can see higher throughput because of this.
Unlike TFS, by default Triton is configured with only a single
[instance count](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups). Hence, at a higher request concurrency, most
of the requests are blocked on the instance availability. To
configure Triton to behave like TFS, set the instance count to a
reasonably high value and then set
TensorFlow Serving (TFS), as of version 2.8.0, by default creates threads for
each request that individually submits requests to TensorFlow Session. There
is a resource limit on the number of concurrent threads serving requests.
When benchmarking at a higher request concurrency, you can see higher
throughput because of this. Unlike TFS, by default Triton is configured with
only a single
[instance count](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups).
Hence, at a higher request concurrency, most of the requests are blocked on
the instance availability. To configure Triton to behave like TFS, set the
instance count to a reasonably high value and then set
[MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters)
parameter in the model confib.pbtxt to the same value.For some
context, the TFS sets its thread constraint to four times the
num of schedulable CPUs.
parameter in the model `config.pbtxt` to the same value. For some context,
the TFS sets its thread constraint to four times the num of schedulable CPUs.
2. `Different library versions`:
The version of TensorFlow might differ between Triton and
TensorFlow Serving being benchmarked. Even the versions of cuda
libraries might differ between the two solutions. The performance
of models can be susceptible to the versions of these libraries.
For a single request concurrency, if the compute_infer time
reported by perf_analyzer when benchmarking Triton is as large as
the latency reported by perf_analyzer when benchmarking TFS, then
the performance difference is likely because of the difference in
the software stack and outside the scope of Triton.
The version of TensorFlow might differ between Triton and TensorFlow Serving
being benchmarked. Even the versions of CUDA libraries might differ between
the two solutions. The performance of models can be susceptible to the
versions of these libraries. For a single request concurrency, if the
`compute_infer` time reported by Perf Analyzer when benchmarking Triton is as
large as the latency reported by Perf Analyzer when benchmarking TFS, then
the performance difference is likely because of the difference in the
software stack and outside the scope of Triton.
3. `CPU Optimization`:
TFS has separate builds for CPU and GPU targets. They have
target-specific optimization. Unlike TFS, Triton has a single build
which is optimized for execution on GPUs. When collecting performance
on CPU models on Triton, try running Triton with the environment
variable `TF_ENABLE_ONEDNN_OPTS=1`.
TFS has separate builds for CPU and GPU targets. They have target-specific
optimization. Unlike TFS, Triton has a single build which is optimized for
execution on GPUs. When collecting performance on CPU models on Triton, try
running Triton with the environment variable `TF_ENABLE_ONEDNN_OPTS=1`.

# Benchmarking TorchServe

perf_analyzer can also be used to benchmark
Perf Analyzer can also be used to benchmark
[TorchServe](https://github.com/pytorch/serve) using the
[`--service-kind`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve) option. The support is however only available through
HTTP protocol. It also requires input to be provided via JSON file.
[`--service-kind=torchserve`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
option. Only HTTP protocol is supported. It also requires input to be provided
via JSON file.

Following invocation demonstrates how to configure perf_analyzer to
issue requests to a running instance of `torchserve` assuming the
location holds `kitten_small.jpg`:
The following invocation demonstrates how to configure Perf Analyzer to issue
requests to a running instance of `torchserve` assuming the location holds
`kitten_small.jpg`:

```
$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json
Expand All @@ -174,7 +211,7 @@ Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec

The content of `data.json`:

```
```json
{
"data" :
[
Expand All @@ -185,33 +222,34 @@ The content of `data.json`:
}
```

You might have to specify a different url(`-u`) to access wherever
the server is running. The report of perf_analyzer will only include
You might have to specify a different url ([`-u`](cli.md#-u-url)) to access
wherever the server is running. The report of Perf Analyzer will only include
statistics measured at the client-side.

**NOTE:** The support is still in **beta**. perf_analyzer does not
guarantee optimum tuning for TorchServe. However, a single benchmarking
tool that can be used to stress the inference servers in an identical
manner is important for performance analysis.
**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee
optimal tuning for TorchServe. However, a single benchmarking tool that can be
used to stress the inference servers in an identical manner is important for
performance analysis.

# Advantages of using Perf Analyzer over third-party benchmark suites

Triton Inference Server offers the entire serving solution which
includes [client libraries](https://github.com/triton-inference-server/client)
that are optimized for Triton.
Using third-party benchmark suites like jmeter fails to take advantage of the
optimized libraries. Some of these optimizations includes but are not limited
to:
Triton Inference Server offers the entire serving solution which includes
[client libraries](https://github.com/triton-inference-server/client) that are
optimized for Triton. Using third-party benchmark suites like `jmeter` fails to
take advantage of the optimized libraries. Some of these optimizations includes
but are not limited to:

1. Using [binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md#binary-tensor-data-extension) with HTTP requests.
1. Using
[binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md#binary-tensor-data-extension)
with HTTP requests.
2. Effective re-use of gRPC message allocation in subsequent requests.
3. Avoiding extra memory copy via libcurl interface.

These optimizations can have a tremendous impact on overall performance.
Using perf_analyzer for benchmarking directly allows a user to access
these optimizations in their study.
These optimizations can have a tremendous impact on overall performance. Using
Perf Analyzer for benchmarking directly allows a user to access these
optimizations in their study.

Not only that, perf_analyzer is also very customizable and supports many
Triton features as described in this document. This, along with a detailed
report, allows a user to identify performance bottlenecks and experiment
with different features before deciding upon what works best for them.
Not only that, Perf Analyzer is also very customizable and supports many Triton
features as described in this document. This, along with a detailed report,
allows a user to identify performance bottlenecks and experiment with different
features before deciding upon what works best for them.
23 changes: 11 additions & 12 deletions src/c++/perf_analyzer/docs/cli.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -292,15 +292,14 @@ Specifies type of data that will be used for input in inference requests. The
available options are `zero`, `random`, and a path to a directory or a JSON
file.

When pointing to a JSON file, the user must adhere to the format described
[here](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/README.md#input-data).
By specifying JSON data, users can control data used with every request.
Multiple data streams can be specified for a sequence model, and Perf Analyzer
will select a data stream in a round-robin fashion for every new sequence.
Muliple JSON files can also be provided
(`--input-data json_file1.json --input-data json_file2.json`
and so on) and Perf Analyzer will append data streams from each file. When
using `--service-kind=torchserve`, make sure this option points to a JSON file.
When pointing to a JSON file, the user must adhere to the format described in
the [input data documentation](input_data.md). By specifying JSON data, users
can control data used with every request. Multiple data streams can be specified
for a sequence model, and Perf Analyzer will select a data stream in a
round-robin fashion for every new sequence. Muliple JSON files can also be
provided (`--input-data json_file1.json --input-data json_file2.json` and so on)
and Perf Analyzer will append data streams from each file. When using
`--service-kind=torchserve`, make sure this option points to a JSON file.

If the option is path to a directory then the directory must contain a binary
text file for each non-string/string input respectively, named the same as the
Expand Down Expand Up @@ -391,14 +390,14 @@ Default is `false`.

#### `-H <string>`

Specifies the header that will be added to HTTP requests (ignored for GRPC
Specifies the header that will be added to HTTP requests (ignored for gRPC
requests). The header must be specified as 'Header:Value'. `-H` may be
specified multiple times to add multiple headers.

#### `--grpc-compression-algorithm=[none|gzip|deflate]`

Specifies the compression algorithm to be used by gRPC when sending requests.
Only supported when grpc protocol is being used.
Only supported when gRPC protocol is being used.

Default is `none`.

Expand Down
2 changes: 1 addition & 1 deletion src/c++/perf_analyzer/docs/inference_load_modes.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
Expand Down
Loading

0 comments on commit 63e6df4

Please sign in to comment.