diff --git a/src/c++/perf_analyzer/README.md b/src/c++/perf_analyzer/README.md index 4bc9b3472..35fbf4720 100644 --- a/src/c++/perf_analyzer/README.md +++ b/src/c++/perf_analyzer/README.md @@ -1,752 +1,170 @@ -# Performance Analyzer - -A critical part of optimizing the inference performance of your model -is being able to measure changes in performance as you experiment with -different optimization strategies. The perf_analyzer application -(previously known as perf_client) performs this task for the Triton -Inference Server. The perf_analyzer is included with the client -examples which are [available from several -sources](https://github.com/triton-inference-server/client#getting-the-client-libraries-and-examples). - -The perf_analyzer application generates inference requests to your -model and measures the throughput and latency of those requests. To -get representative results, perf_analyzer measures the throughput and -latency over a time window, and then repeats the measurements until it -gets stable values. By default perf_analyzer uses average latency to -determine stability but you can use the --percentile flag to stabilize -results based on that confidence level. For example, if ---percentile=95 is used the results will be stabilized using the 95-th -percentile request latency. For example, +# Triton Performance Analyzer -``` -$ perf_analyzer -m inception_graphdef --percentile=95 -*** Measurement Settings *** - Batch size: 1 - Measurement window: 5000 msec - Using synchronous calls for inference - Stabilizing using p95 latency - -Request concurrency: 1 - Client: - Request count: 348 - Throughput: 69.6 infer/sec - p50 latency: 13936 usec - p90 latency: 18682 usec - p95 latency: 19673 usec - p99 latency: 21859 usec - Avg HTTP time: 14017 usec (send/recv 200 usec + response wait 13817 usec) - Server: - Inference count: 428 - Execution count: 428 - Successful request count: 428 - Avg request latency: 12005 usec (overhead 36 usec + queue 42 usec + compute input 164 usec + compute infer 11748 usec + compute output 15 usec) - -Inferences/Second vs. Client p95 Batch Latency -Concurrency: 1, throughput: 69.6 infer/sec, latency 19673 usec -``` - -## Request Concurrency - -By default perf_analyzer measures your model's latency and throughput -using the lowest possible load on the model. To do this perf_analyzer -sends one inference request to Triton and waits for the response. -When that response is received, the perf_analyzer immediately sends -another request, and then repeats this process during the measurement -windows. The number of outstanding inference requests is referred to -as the *request concurrency*, and so by default perf_analyzer uses a -request concurrency of 1. +Triton Performance Analyzer is CLI tool which can help you optimize the +inference performance of models running on Triton Inference Server by measuring +changes in performance as you experiment with different optimization strategies. -Using the --concurrency-range \:\:\ option you can have -perf_analyzer collect data for a range of request concurrency -levels. Use the --help option to see complete documentation for this -and other options. For example, to see the latency and throughput of -your model for request concurrency values from 1 to 4: +
-``` -$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -*** Measurement Settings *** - Batch size: 1 - Measurement window: 5000 msec - Latency limit: 0 msec - Concurrency limit: 4 concurrent requests - Using synchronous calls for inference - Stabilizing using average latency - -Request concurrency: 1 - Client: - Request count: 339 - Throughput: 67.8 infer/sec - Avg latency: 14710 usec (standard deviation 2539 usec) - p50 latency: 13665 usec -... -Request concurrency: 4 - Client: - Request count: 415 - Throughput: 83 infer/sec - Avg latency: 48064 usec (standard deviation 6412 usec) - p50 latency: 47975 usec - p90 latency: 56670 usec - p95 latency: 59118 usec - p99 latency: 63609 usec - Avg HTTP time: 48166 usec (send/recv 264 usec + response wait 47902 usec) - Server: - Inference count: 498 - Execution count: 498 - Successful request count: 498 - Avg request latency: 45602 usec (overhead 39 usec + queue 33577 usec + compute input 217 usec + compute infer 11753 usec + compute output 16 usec) - -Inferences/Second vs. Client Average Batch Latency -Concurrency: 1, throughput: 67.8 infer/sec, latency 14710 usec -Concurrency: 2, throughput: 89.8 infer/sec, latency 22280 usec -Concurrency: 3, throughput: 80.4 infer/sec, latency 37283 usec -Concurrency: 4, throughput: 83 infer/sec, latency 48064 usec -``` +# Features -## Understanding The Output +### Inference Load Modes -### How Throughput is Calculated +- [Concurrency Mode](docs/inference_load_modes.md#concurrency-mode) simlulates + load by maintaining a specific concurrency of outgoing requests to the + server -Perf Analyzer calculates throughput to be the total number of requests completed -during a measurement, divided by the duration of the measurement, in seconds. +- [Request Rate Mode](docs/inference_load_modes.md#request-rate-mode) simulates + load by sending consecutive requests at a specific rate to the server -### How Latency is Calculated +- [Custom Interval Mode](docs/inference_load_modes.md#custom-interval-mode) + simulates load by sending consecutive requests at specific intervals to the + server -For each request concurrency level perf_analyzer reports latency and -throughput as seen from the *client* (that is, as seen by -perf_analyzer) and also the average request latency on the server. +### Performance Measurement Modes -The server latency measures the total time from when the request is -received at the server until the response is sent from the -server. Because of the HTTP and GRPC libraries used to implement the -server endpoints, total server latency is typically more accurate for -HTTP requests as it measures time from first byte received until last -byte sent. For both HTTP and GRPC the total server latency is -broken-down into the following components: +- [Time Windows Mode](docs/measurements_metrics.md#time-windows) measures model + performance repeatedly over a specific time interval until performance has + stabilized -- *queue*: The average time spent in the inference schedule queue by a - request waiting for an instance of the model to become available. -- *compute*: The average time spent performing the actual inference, - including any time needed to copy data to/from the GPU. +- [Count Windows Mode](docs/measurements_metrics.md#count-windows) measures + model performance repeatedly over a specific number of requests until + performance has stabilized -The client latency time is broken-down further for HTTP and GRPC as -follows: +### Other Features -- HTTP: *send/recv* indicates the time on the client spent sending the - request and receiving the response. *response wait* indicates time - waiting for the response from the server. -- GRPC: *(un)marshal request/response* indicates the time spent - marshalling the request data into the GRPC protobuf and - unmarshalling the response data from the GRPC protobuf. *response - wait* indicates time writing the GRPC request to the network, - waiting for the response, and reading the GRPC response from the - network. +- [Sequence Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#stateful-models) + and + [Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) + can be profiled in addition to standard/stateless models -Use the verbose (-v) option to perf_analyzer to see more output, -including the stabilization passes run for each request concurrency -level. +- [Input Data](docs/input_data.md) to model inferences can be auto-generated or + specified as well as verifying output -## Measurement Modes +- [TensorFlow Serving](docs/benchmarking.md#benchmarking-tensorflow-serving) and + [TorchServe](docs/benchmarking.md#benchmarking-torchserve) can be used as the + inference server in addition to the default Triton server -### Time Windows +
-When using time windows measurement mode (`--measurement-mode=time_windows`), -Perf Analyzer will count how many requests have completed during a window of -duration `X` (in milliseconds, via `--measurement-interval=X`, default is -`5000`). This is the default measurement mode. +# Quick Start -### Count Windows +The steps below will guide you on how to start using Perf Analyzer. -When using count windows measurement mode (`--measurement-mode=count_windows`), -Perf Analyzer will start the window duration at 1 second and potentially -dynamically increase it until `X` requests have completed (via -`--measurement-request-count=X`, default is `50`). +### Step 1: Start Triton Container -## Visualizing Latency vs. Throughput +```bash +export RELEASE= # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02` -The perf_analyzer provides the -f option to generate a file containing -CSV output of the results. +docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3 +docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3 ``` -$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv -$ cat perf.csv -Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency -1,69.2,225,2148,64,206,11781,19,0,13891,18795,19753,21018 -3,84.2,237,1768,21673,209,11742,17,0,35398,43984,47085,51701 -4,84.2,279,1604,33669,233,11731,18,1,47045,56545,59225,64886 -2,87.2,235,1973,9151,190,11346,17,0,21874,28557,29768,34766 -``` - -NOTE: The rows in the CSV file are sorted in an increasing order of throughput (Inferences/Second). - -You can import the CSV file into a spreadsheet to help visualize -the latency vs inferences/second tradeoff as well as see some -components of the latency. Follow these steps: - -- Open [this - spreadsheet](https://docs.google.com/spreadsheets/d/1S8h0bWBBElHUoLd2SOvQPzZzRiQ55xjyqodm_9ireiw) -- Make a copy from the File menu "Make a copy..." -- Open the copy -- Select the A1 cell on the "Raw Data" tab -- From the File menu select "Import..." -- Select "Upload" and upload the file -- Select "Replace data at selected cell" and then select the "Import data" button - -### Server-side Prometheus metrics - -Perf Analyzer can collect -[server-side metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md#gpu-metrics) -, such as GPU utilization and GPU power usage. To enable the collection of these metrics, -use the `--collect-metrics` CLI option. - -Perf Analyzer defaults to access the metrics endpoint at -`localhost:8002/metrics`. If the metrics are accessible at a different url, use -the `--metrics-url ` CLI option to specify that. - -Perf Analyzer defaults to access the metrics endpoint every 1000 milliseconds. -To use a different accessing interval, use the `--metrics-interval ` -CLI option (specify in milliseconds). - -Because Perf Analyzer can collect the server-side metrics multiple times per -run, these metrics are aggregated in specific ways to produce one final number -per sweep (concurrency/request rate). Here are how they are aggregated: -| Metric | Aggregation | -|--------|-------------| -| GPU Utilization | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. | -| GPU Power Usage | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. | -| GPU Used Memory | Maximum from all collections taken during a stable pass. Users are typically curious what the peak memory usage is for determining model/hardware viability. | -| GPU Total Memory | First from any collection taken during a stable pass. All of the collections should produce the same value for total memory available on the GPU. | - -Note that all metrics are per-GPU in the case of multi-GPU systems. - -To output these server-side metrics to a CSV file, use the `-f ` and -`--verbose-csv` CLI options. The output CSV will contain one column per metric. -The value of each column will be a `key:value` pair (`GPU UUID:metric value`). -Each `key:value` pair will be delimited by a semicolon (`;`) to indicate metric -values for each GPU accessible by the server. There is a trailing semicolon. See -below: - -`:;:;...;` - -Here is a simplified CSV output: +### Step 2: Download `simple` Model ```bash -$ perf_analyzer -m resnet50_libtorch --collect-metrics -f output.csv --verbose-csv -$ cat output.csv -Concurrency,...,Avg GPU Utilization,Avg GPU Power Usage,Max GPU Memory Usage,Total GPU Memory -1,...,gpu_uuid_0:0.33;gpu_uuid_1:0.5;,gpu_uuid_0:55.3;gpu_uuid_1:56.9;,gpu_uuid_0:10000;gpu_uuid_1:11000;,gpu_uuid_0:50000;gpu_uuid_1:75000;, -2,...,gpu_uuid_0:0.25;gpu_uuid_1:0.6;,gpu_uuid_0:25.6;gpu_uuid_1:77.2;,gpu_uuid_0:11000;gpu_uuid_1:17000;,gpu_uuid_0:50000;gpu_uuid_1:75000;, -3,...,gpu_uuid_0:0.87;gpu_uuid_1:0.9;,gpu_uuid_0:87.1;gpu_uuid_1:71.7;,gpu_uuid_0:15000;gpu_uuid_1:22000;,gpu_uuid_0:50000;gpu_uuid_1:75000;, -``` - -## Input Data - -Use the --help option to see complete documentation for all input -data options. By default perf_analyzer sends random data to all the -inputs of your model. You can select a different input data mode with -the --input-data option: - -- *random*: (default) Send random data for each input. -- *zero*: Send zeros for each input. -- directory path: A path to a directory containing a binary file for each input, named the same as the input. Each binary file must contain the data required for that input for a batch-1 request. Each file should contain the raw binary representation of the input in row-major order. -- file path: A path to a JSON file containing data to be used with every inference request. See the "Real Input Data" section for further details. --input-data can be provided multiple times with different file paths to specific multiple JSON files. - -For tensors with with STRING/BYTES datatype there are additional -options --string-length and --string-data that may be used in some -cases (see --help for full documentation). - -For models that support batching you can use the -b option to indicate -the batch-size of the requests that perf_analyzer should send. For -models with variable-sized inputs you must provide the --shape -argument so that perf_analyzer knows what shape tensors to use. For -example, for a model that has an input called *IMAGE* that has shape [ -3, N, M ], where N and M are variable-size dimensions, to tell -perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]: +# inside triton container +git clone --depth 1 https://github.com/triton-inference-server/server +mkdir model_repository ; cp -r server/docs/examples/model_repository/simple model_repository ``` -$ perf_analyzer -m mymodel -b 4 --shape IMAGE:3,224,224 -``` - -## Real Input Data -The performance of some models is highly dependent on the data used. -For such cases you can provide data to be used with every inference -request made by analyzer in a JSON file. The perf_analyzer will use -the provided data in a round-robin order when sending inference -requests. For sequence models, if a sequence length is specified via -`--sequence-length`, perf_analyzer will also loop through the provided data in a -round-robin order up to the specified sequence length (with a percentage -variation customizable via `--sequence-length-variation`). Otherwise, the -sequence length will be the number of inputs specified in user-provided input -data. - -Each entry in the "data" array must specify all input tensors with the -exact size expected by the model from a single batch. The following -example describes data for a model with inputs named, INPUT0 and -INPUT1, shape [4, 4] and data type INT32: - -``` - { - "data" : - [ - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - }, - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - }, - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - }, - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - ... - ] - } -``` +### Step 3: Start Triton Server -Note that the [4, 4] tensor has been flattened in a row-major format -for the inputs. In addition to specifying explicit tensors, you can -also provide Base64 encoded binary data for the tensors. Each data -object must list its data in a row-major order. Binary data must be in -little-endian byte order. The following example highlights how this -can be acheived: - -``` - { - "data" : - [ - { - "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="}, - "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="} - }, - { - "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="}, - "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="} - }, - { - "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="}, - "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="} - }, - ... - ] - } -``` +```bash +# inside triton container +tritonserver --model-repository $(pwd)/model_repository &> server.log & -In case of sequence models, multiple data streams can be specified in -the JSON file. Each sequence will get a data stream of its own and the -analyzer will ensure the data from each stream is played back to the -same correlation id. The below example highlights how to specify data -for multiple streams for a sequence model with a single input named -INPUT, shape [1] and data type STRING: +# confirm server is ready, look for 'HTTP/1.1 200 OK' +curl -v localhost:8000/v2/health/ready +# detatch (CTRL-p CTRL-q) ``` - { - "data" : - [ - [ - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["2"] - }, - { - "INPUT" : ["3"] - }, - { - "INPUT" : ["4"] - } - ], - [ - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["1"] - } - ], - [ - { - "INPUT" : ["1"] - }, - { - "INPUT" : ["1"] - } - ] - ] - } -``` - -The above example describes three data streams with lengths 4, 3 and 2 -respectively. The perf_analyzer will hence produce sequences of -length 4, 3 and 2 in this case. - -You can also provide an optional "shape" field to the tensors. This is -especially useful while profiling the models with variable-sized -tensors as input. Additionally note that when providing the "shape" field, -tensor contents must be provided separately in "content" field in row-major -order. The specified shape values will override default input shapes -provided as a command line option (see --shape) for variable-sized inputs. -In the absence of "shape" field, the provided defaults will be used. There -is no need to specify shape as a command line option if all the data steps -provide shape values for variable tensors. Below is an example json file -for a model with single input "INPUT", shape [-1,-1] and data type INT32: -``` - { - "data" : - [ - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "shape": [2,8] - } - }, - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "shape": [8,2] - } - }, - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - }, - { - "INPUT" : - { - "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "shape": [4,4] - } - } - ... - ] - } -``` +### Step 4: Start Triton SDK Container -The following is the example to provide contents as base64 string with explicit shapes: +```bash +docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk +docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk ``` -{ - "data": [{ - "INPUT": { - "content": {"b64": "/9j/4AAQSkZ(...)"}, - "shape": [7964] - }}, - (...)] -} -``` - -Note that for STRING type an element is represented by a 4-byte unsigned integer giving -the length followed by the actual bytes. The byte array to be encoded using base64 must -include the 4-byte unsigned integers. - -### Output Validation -When real input data is provided, it is optional to request perf analyzer to -validate the inference output for the input data. +### Step 5: Run Perf Analyzer -Validation output can be specified in "validation_data" field in the same format -as "data" field for real input. Note that the entries in "validation_data" must -align with "data" for proper mapping. The following example describes validation -data for a model with inputs named, INPUT0 and INPUT1, outputs named, OUTPUT0 -and OUTPUT1, all tensors have shape [4, 4] and data type INT32: - -``` - { - "data" : - [ - { - "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], - "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] - } - ... - ], - "validation_data" : - [ - { - "OUTPUT0" : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - "OUTPUT1" : [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] - } - ... - ] - } +```bash +# inside sdk container +perf_analyzer -m simple ``` -Besides the above example, the validation outputs can be specified in the same -variations described in "real input data" section. - -## Shared Memory +See the full [quick start guide](docs/quick_start.md) for additional tips on +how to analyze output. -By default perf_analyzer sends input tensor data and receives output -tensor data over the network. You can instead instruct perf_analyzer to -use system shared memory or CUDA shared memory to communicate tensor -data. By using these options you can model the performance that you -can achieve by using shared memory in your application. Use ---shared-memory=system to use system (CPU) shared memory or ---shared-memory=cuda to use CUDA shared memory. +
-## Communication Protocol +# Documentation -By default perf_analyzer uses HTTP to communicate with Triton. The GRPC -protocol can be specificed with the -i option. If GRPC is selected the ---streaming option can also be specified for GRPC streaming. +- [Installation](docs/install.md) +- [Perf Analyzer CLI](docs/cli.md) +- [Inference Load Modes](docs/inference_load_modes.md) +- [Input Data](docs/input_data.md) +- [Measurements & Metrics](docs/measurements_metrics.md) +- [Benchmarking](docs/benchmarking.md) -### SSL/TLS Support +
-perf_analyzer can be used to benchmark Triton service behind SSL/TLS-enabled endpoints. These options can help in establishing secure connection with the endpoint and profile the server. +# Contributing -For gRPC, see the following options: +Contributions to Triton Perf Analyzer are more than welcome. To contribute +please review the [contribution +guidelines](https://github.com/triton-inference-server/server/blob/main/CONTRIBUTING.md), +then fork and create a pull request. -* `--ssl-grpc-use-ssl` -* `--ssl-grpc-root-certifications-file` -* `--ssl-grpc-private-key-file` -* `--ssl-grpc-certificate-chain-file` +
-More details here: https://grpc.github.io/grpc/cpp/structgrpc_1_1_ssl_credentials_options.html +# Reporting problems, asking questions -The -[inference protocol gRPC SSL/TLS section](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#ssltls) -describes server-side options to configure SSL/TLS in Triton's gRPC endpoint. +We appreciate any feedback, questions or bug reporting regarding this +project. When help with code is needed, follow the process outlined in +the Stack Overflow (https://stackoverflow.com/help/mcve) +document. Ensure posted examples are: -For HTTPS, the following options are exposed: +- minimal - use as little code as possible that still produces the + same problem -* `--ssl-https-verify-peer` -* `--ssl-https-verify-host` -* `--ssl-https-ca-certificates-file` -* `--ssl-https-client-certificate-file` -* `--ssl-https-client-certificate-type` -* `--ssl-https-private-key-file` -* `--ssl-https-private-key-type` +- complete - provide all parts needed to reproduce the problem. Check + if you can strip external dependency and still show the problem. The + less time we spend on reproducing problems the more time we have to + fix it -See `--help` for full documentation. - -Unlike gRPC, Triton's HTTP server endpoint can not be configured with SSL/TLS support. - -Note: Just providing these `--ssl-http-*` options to perf_analyzer does not ensure the SSL/TLS is used in communication. If SSL/TLS is not enabled on the service endpoint, these options have no effect. The intent of exposing these options to a user of perf_analyzer is to allow them to configure perf_analyzer to benchmark Triton service behind SSL/TLS-enabled endpoints. In other words, if Triton is running behind a HTTPS server proxy, then these options would allow perf_analyzer to profile Triton via exposed HTTPS proxy. - -## Benchmarking Triton directly via C API - -Besides using HTTP or gRPC server endpoints to communicate with Triton, perf_analyzer also allows user to benchmark Triton directly using C API. HTTP/gRPC endpoints introduce an additional latency in the pipeline which may not be of interest to the user who is using Triton via C API within their application. Specifically, this feature is useful to benchmark bare minimum Triton without additional overheads from HTTP/gRPC communication. - -### Prerequisite -Pull the Triton SDK and the Inference Server container images on target machine. -Since you will need access to the Tritonserver install, it might be easier if -you copy the perf_analyzer binary to the Inference Server container. - -### Required Parameters -Use the --help option to see complete list of supported command line arguments. -By default perf_analyzer expects the Triton instance to already be running. You can configure the C API mode using the `--service-kind` option. In additon, you will need to point -perf_analyzer to the Triton server library path using the `--triton-server-directory` option and the model -repository path using the `--model-repository` option. -If the server is run successfully, there is a prompt: "server is alive!" and perf_analyzer will print the stats, as normal. -An example run would look like: -``` -perf_analyzer -m graphdef_int32_int32_int32 --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/workspace/qa/L0_perf_analyzer_capi/models -``` - -### Non-supported functionalities -There are a few functionalities that are missing from the C API. They are: -1. Async mode (`-a`) -2. Using shared memory mode (`--shared-memory=cuda` or `--shared-memory=system`) -3. Request rate range mode -4. For additonal known non-working cases, please refer to - [qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277) - - -## Benchmarking TensorFlow Serving -perf_analyzer can also be used to benchmark models deployed on -[TensorFlow Serving](https://github.com/tensorflow/serving) using -the `--service-kind` option. The support is however only available -through gRPC protocol. - -Following invocation demonstrates how to configure perf_analyzer -to issue requests to a running instance of -`tensorflow_model_server`: - -``` -$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500 -*** Measurement Settings *** - Batch size: 1 - Using "time_windows" mode for stabilization - Measurement window: 5000 msec - Using synchronous calls for inference - Stabilizing using average latency -Request concurrency: 1 - Client: - Request count: 829 - Throughput: 165.8 infer/sec - Avg latency: 6032 usec (standard deviation 569 usec) - p50 latency: 5863 usec - p90 latency: 6655 usec - p95 latency: 6974 usec - p99 latency: 8093 usec - Avg gRPC time: 5984 usec ((un)marshal request/response 257 usec + response wait 5727 usec) -Inferences/Second vs. Client Average Batch Latency -Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec -``` - -You might have to specify a different url(`-u`) to access wherever -the server is running. The report of perf_analyzer will only -include statistics measured at the client-side. - -**NOTE:** The support is still in **beta**. perf_analyzer does -not guarantee optimum tuning for TensorFlow Serving. However, a -single benchmarking tool that can be used to stress the inference -servers in an identical manner is important for performance -analysis. - - -The following points are important for interpreting the results: -1. `Concurrent Request Execution`: -TensorFlow Serving (TFS), as of version 2.8.0, by default creates -threads for each request that individually submits requests to -TensorFlow Session. There is a resource limit on the number of -concurrent threads serving requests. When benchmarking at a higher -request concurrency, you can see higher throughput because of this. -Unlike TFS, by default Triton is configured with only a single -[instance count](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups) -. Hence, at a higher request concurrency, most -of the requests are blocked on the instance availability. To -configure Triton to behave like TFS, set the instance count to a -reasonably high value and then set -[MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters) -parameter in the model confib.pbtxt to the same value.For some -context, the TFS sets its thread constraint to four times the -num of schedulable CPUs. -2. `Different library versions`: -The version of TensorFlow might differ between Triton and -TensorFlow Serving being benchmarked. Even the versions of cuda -libraries might differ between the two solutions. The performance -of models can be susceptible to the versions of these libraries. -For a single request concurrency, if the compute_infer time -reported by perf_analyzer when benchmarking Triton is as large as -the latency reported by perf_analyzer when benchmarking TFS, then -the performance difference is likely because of the difference in -the software stack and outside the scope of Triton. -3. `CPU Optimization`: -TFS has separate builds for CPU and GPU targets. They have -target-specific optimization. Unlike TFS, Triton has a single build -which is optimized for execution on GPUs. When collecting performance -on CPU models on Triton, try running Triton with the environment -variable `TF_ENABLE_ONEDNN_OPTS=1`. - - -## Benchmarking TorchServe -perf_analyzer can also be used to benchmark -[TorchServe](https://github.com/pytorch/serve) using the -`--service-kind` option. The support is however only available through -HTTP protocol. It also requires input to be provided via JSON file. - -Following invocation demonstrates how to configure perf_analyzer to -issue requests to a running instance of `torchserve` assuming the -location holds `kitten_small.jpg`: - -``` -$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json - Successfully read data for 1 stream/streams with 1 step/steps. -*** Measurement Settings *** - Batch size: 1 - Using "time_windows" mode for stabilization - Measurement window: 5000 msec - Using synchronous calls for inference - Stabilizing using average latency -Request concurrency: 1 - Client: - Request count: 799 - Throughput: 159.8 infer/sec - Avg latency: 6259 usec (standard deviation 397 usec) - p50 latency: 6305 usec - p90 latency: 6448 usec - p95 latency: 6494 usec - p99 latency: 7158 usec - Avg HTTP time: 6272 usec (send/recv 77 usec + response wait 6195 usec) -Inferences/Second vs. Client Average Batch Latency -Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec -``` - -The content of `data.json`: - -``` - { - "data" : - [ - { - "TORCHSERVE_INPUT" : ["kitten_small.jpg"] - } - ] - } -``` - -You might have to specify a different url(`-u`) to access wherever -the server is running. The report of perf_analyzer will only include -statistics measured at the client-side. - -**NOTE:** The support is still in **beta**. perf_analyzer does not -guarantee optimum tuning for TorchServe. However, a single benchmarking -tool that can be used to stress the inference servers in an identical -manner is important for performance analysis. - -## Advantages of using Perf Analyzer over third-party benchmark suites - -Triton Inference Server offers the entire serving solution which -includes [client libraries](https://github.com/triton-inference-server/client) -that are optimized for Triton. -Using third-party benchmark suites like jmeter fails to take advantage of the -optimized libraries. Some of these optimizations includes but are not limited -to: -1. Using -[binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) -with HTTP requests. -2. Effective re-use of gRPC message allocation in subsequent requests. -3. Avoiding extra memory copy via libcurl interface. - -These optimizations can have a tremendous impact on overall performance. -Using perf_analyzer for benchmarking directly allows a user to access -these optimizations in their study. - -Not only that, perf_analyzer is also very customizable and supports many -Triton features as described in this document. This, along with a detailed -report, allows a user to identify performance bottlenecks and experiment -with different features before deciding upon what works best for them. +- verifiable - test the code you're about to provide to make sure it + reproduces the problem. Remove all other problems that are not + related to your request/question. diff --git a/src/c++/perf_analyzer/command_line_parser.cc b/src/c++/perf_analyzer/command_line_parser.cc index 5ca4fc9f6..4a226c1e1 100644 --- a/src/c++/perf_analyzer/command_line_parser.cc +++ b/src/c++/perf_analyzer/command_line_parser.cc @@ -168,8 +168,8 @@ CLParser::Usage(const std::string& msg) "{\"data\" : [{\"TORCHSERVE_INPUT\" : [\"\"]}, {...}...]}. The type of file here will depend " "on the model. In order to use \"triton_c_api\" you must specify " - "the Triton server install path and the model repository " - "path via the --library-name and --model-repo flags", + "the Triton server install path and the model repository path via " + "the --triton-server-directory and --model-repository flags", 18) << std::endl; diff --git a/src/c++/perf_analyzer/docs/README.md b/src/c++/perf_analyzer/docs/README.md new file mode 100644 index 000000000..485c4207f --- /dev/null +++ b/src/c++/perf_analyzer/docs/README.md @@ -0,0 +1,54 @@ + + +# **Perf Analyzer Documentation** + +| [Installation](README.md#installation) | [Getting Started](README.md#getting-started) | [User Guide](README.md#user-guide) | +| -------------------------------------- | -------------------------------------------- | ---------------------------------- | + +## **Installation** + +See the [Installation Guide](install.md) for details on how to install Perf +Analyzer. + +## **Getting Started** + +The [Quick Start Guide](quick_start.md) will show you how to use Perf +Analyzer to profile a simple PyTorch model. + +## **User Guide** + +The User Guide describes the Perf Analyzer command line options, how to specify +model input data, the performance measurement modes, the performance metrics and +outputs, how to benchmark different servers, and more. + +- [Perf Analyzer CLI](cli.md) +- [Inference Load Modes](inference_load_modes.md) +- [Input Data](input_data.md) +- [Measurements & Metrics](measurements_metrics.md) +- [Benchmarking](benchmarking.md) diff --git a/src/c++/perf_analyzer/docs/benchmarking.md b/src/c++/perf_analyzer/docs/benchmarking.md new file mode 100644 index 000000000..d900ff852 --- /dev/null +++ b/src/c++/perf_analyzer/docs/benchmarking.md @@ -0,0 +1,250 @@ + + +# Benchmarking Triton via HTTP or gRPC endpoint + +This is the default mode for Perf Analyzer. + +# Benchmarking Triton directly via C API + +Besides using HTTP or gRPC server endpoints to communicate with Triton, Perf +Analyzer also allows users to benchmark Triton directly using the C API. HTTP +and gRPC endpoints introduce an additional latency in the pipeline which may not +be of interest to users who are using Triton via C API within their application. +Specifically, this feature is useful to benchmark a bare minimum Triton without +additional overheads from HTTP/gRPC communication. + +## Prerequisite + +Pull the Triton SDK and the Triton Server container images on target machine. +Since you will need access to the `tritonserver` install, it might be easier if +you copy the `perf_analyzer` binary to the Inference Server container. + +## Required parameters + +Use the `--help` option to see a complete list of supported command line +arguments. By default, Perf Analyzer expects the Triton instance to already be +running. You can configure C API mode using the +[`--service-kind`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve) +option. In additon, you will need to point Perf Analyzer to the Triton server +library path using the +[`--triton-server-directory`](cli.md#--triton-server-directorypath) option and +the model repository path using the +[`--model-repository`](cli.md#--model-repositorypath) option. + +An example run would look like: + +``` +$ perf_analyzer -m my_model --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/my/model/repository +... +*** Measurement Settings *** + Service Kind: Triton C-API + Using "time_windows" mode for stabilization + Measurement window: 5000 msec + Using synchronous calls for inference + Stabilizing using average latency + +Request concurrency: 1 + Client: + Request count: 353 + Throughput: 19.6095 infer/sec + Avg latency: 50951 usec (standard deviation 2265 usec) + p50 latency: 50833 usec + p90 latency: 50923 usec + p95 latency: 50940 usec + p99 latency: 50985 usec + + Server: + Inference count: 353 + Execution count: 353 + Successful request count: 353 + Avg request latency: 50841 usec (overhead 20 usec + queue 63 usec + compute input 35 usec + compute infer 50663 usec + compute output 59 usec) + +Inferences/Second vs. Client Average Batch Latency +Concurrency: 1, throughput: 19.6095 infer/sec, latency 50951 usec +``` + +## Non-supported functionalities + +There are a few functionalities that are missing from C API mode. They are: + +1. Async mode ([`--async`](cli.md#--async)) +2. For additonal known non-working cases, please refer to + [qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277) + +# Benchmarking TensorFlow Serving + +Perf Analyzer can also be used to benchmark models deployed on +[TensorFlow Serving](https://github.com/tensorflow/serving) using the +[`--service-kind=tfserving`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve) +option. Only gRPC protocol is supported. + +The following invocation demonstrates how to configure Perf Analyzer to issue +requests to a running instance of `tensorflow_model_server`: + +``` +$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500 +*** Measurement Settings *** + Batch size: 1 + Using "time_windows" mode for stabilization + Measurement window: 5000 msec + Using synchronous calls for inference + Stabilizing using average latency +Request concurrency: 1 + Client: + Request count: 829 + Throughput: 165.8 infer/sec + Avg latency: 6032 usec (standard deviation 569 usec) + p50 latency: 5863 usec + p90 latency: 6655 usec + p95 latency: 6974 usec + p99 latency: 8093 usec + Avg gRPC time: 5984 usec ((un)marshal request/response 257 usec + response wait 5727 usec) +Inferences/Second vs. Client Average Batch Latency +Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec +``` + +You might have to specify a different url ([`-u`](cli.md#-u-url)) to access +wherever the server is running. The report of Perf Analyzer will only include +statistics measured at the client-side. + +**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee +optimal tuning for TensorFlow Serving. However, a single benchmarking tool that +can be used to stress the inference servers in an identical manner is important +for performance analysis. + +The following points are important for interpreting the results: + +1. `Concurrent Request Execution`: + TensorFlow Serving (TFS), as of version 2.8.0, by default creates threads for + each request that individually submits requests to TensorFlow Session. There + is a resource limit on the number of concurrent threads serving requests. + When benchmarking at a higher request concurrency, you can see higher + throughput because of this. Unlike TFS, by default Triton is configured with + only a single + [instance count](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups). + Hence, at a higher request concurrency, most of the requests are blocked on + the instance availability. To configure Triton to behave like TFS, set the + instance count to a reasonably high value and then set + [MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters) + parameter in the model `config.pbtxt` to the same value. For some context, + the TFS sets its thread constraint to four times the num of schedulable CPUs. +2. `Different library versions`: + The version of TensorFlow might differ between Triton and TensorFlow Serving + being benchmarked. Even the versions of CUDA libraries might differ between + the two solutions. The performance of models can be susceptible to the + versions of these libraries. For a single request concurrency, if the + `compute_infer` time reported by Perf Analyzer when benchmarking Triton is as + large as the latency reported by Perf Analyzer when benchmarking TFS, then + the performance difference is likely because of the difference in the + software stack and outside the scope of Triton. +3. `CPU Optimization`: + TFS has separate builds for CPU and GPU targets. They have target-specific + optimization. Unlike TFS, Triton has a single build which is optimized for + execution on GPUs. When collecting performance on CPU models on Triton, try + running Triton with the environment variable `TF_ENABLE_ONEDNN_OPTS=1`. + +# Benchmarking TorchServe + +Perf Analyzer can also be used to benchmark +[TorchServe](https://github.com/pytorch/serve) using the +[`--service-kind=torchserve`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve) +option. Only HTTP protocol is supported. It also requires input to be provided +via JSON file. + +The following invocation demonstrates how to configure Perf Analyzer to issue +requests to a running instance of `torchserve` assuming the location holds +`kitten_small.jpg`: + +``` +$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json + Successfully read data for 1 stream/streams with 1 step/steps. +*** Measurement Settings *** + Batch size: 1 + Using "time_windows" mode for stabilization + Measurement window: 5000 msec + Using synchronous calls for inference + Stabilizing using average latency +Request concurrency: 1 + Client: + Request count: 799 + Throughput: 159.8 infer/sec + Avg latency: 6259 usec (standard deviation 397 usec) + p50 latency: 6305 usec + p90 latency: 6448 usec + p95 latency: 6494 usec + p99 latency: 7158 usec + Avg HTTP time: 6272 usec (send/recv 77 usec + response wait 6195 usec) +Inferences/Second vs. Client Average Batch Latency +Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec +``` + +The content of `data.json`: + +```json + { + "data" : + [ + { + "TORCHSERVE_INPUT" : ["kitten_small.jpg"] + } + ] + } +``` + +You might have to specify a different url ([`-u`](cli.md#-u-url)) to access +wherever the server is running. The report of Perf Analyzer will only include +statistics measured at the client-side. + +**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee +optimal tuning for TorchServe. However, a single benchmarking tool that can be +used to stress the inference servers in an identical manner is important for +performance analysis. + +# Advantages of using Perf Analyzer over third-party benchmark suites + +Triton Inference Server offers the entire serving solution which includes +[client libraries](https://github.com/triton-inference-server/client) that are +optimized for Triton. Using third-party benchmark suites like `jmeter` fails to +take advantage of the optimized libraries. Some of these optimizations includes +but are not limited to: + +1. Using + [binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md#binary-tensor-data-extension) + with HTTP requests. +2. Effective re-use of gRPC message allocation in subsequent requests. +3. Avoiding extra memory copy via libcurl interface. + +These optimizations can have a tremendous impact on overall performance. Using +Perf Analyzer for benchmarking directly allows a user to access these +optimizations in their study. + +Not only that, Perf Analyzer is also very customizable and supports many Triton +features as described in this document. This, along with a detailed report, +allows a user to identify performance bottlenecks and experiment with different +features before deciding upon what works best for them. diff --git a/src/c++/perf_analyzer/docs/cli.md b/src/c++/perf_analyzer/docs/cli.md new file mode 100644 index 000000000..19844514e --- /dev/null +++ b/src/c++/perf_analyzer/docs/cli.md @@ -0,0 +1,601 @@ + + +# Perf Analyzer CLI + +This document details the Perf Analyzer command line interface: + +- [General Options](#general-options) +- [Measurement Options](#measurement-options) +- [Sequence Model Options](#sequence-model-options) +- [Input Data Options](#input-data-options) +- [Request Options](#request-options) +- [Server Options](#server-options) +- [Prometheus Metrics Options](#prometheus-metrics-options) +- [Report Options](#report-options) +- [Trace Options](#trace-options) +- [Deprecated Options](#deprecated-options) + +## General Options + +#### `-?` +#### `-h` +#### `--help` + +Prints a description of the Perf Analyzer command line interface. + +#### `-m ` + +Specifies the model name for Perf Analyzer to run. + +This is a required option. + +#### `-x ` + +Specifies the version of the model to be used. If not specified the most +recent version (the highest numbered version) of the model will be used. + +#### `--service-kind=[triton|triton_c_api|tfserving|torchserve]` + +Specifies the kind of service for Perf Analyzer to generate load for. Note: in +order to use `torchserve` backend, the `--input-data` option must point to a +JSON file holding data in the following format: + +``` +{ + "data": [ + { + "TORCHSERVE_INPUT": [ + "" + ] + }, + {...}, + ... + ] +} +``` + +The type of file here will depend on the model. In order to use `triton_c_api` +you must specify the Triton server install path and the model repository path +via the `--triton-server-directory` and `--model-repository` options. + +Default is `triton`. + +#### `--bls-composing-models=` + +Specifies the list of all BLS composing models as a comma separated list of +model names (with optional model version number after a colon for each) that may +be called by the input BLS model. For example, +`--bls-composing-models=modelA:3,modelB` would specify that modelA and modelB +are composing models that may be called by the input BLS model, and that modelA +will use version 3, while modelB's version is unspecified. + +#### `--model-signature-name=` + +Specifies the signature name of the saved model to use. + +Default is `serving_default`. This option will be ignored if `--service-kind` +is not `tfserving`. + +#### `-v` + +Enables verbose mode. May be specified an additional time (`-v -v`) to enable +extra verbose mode. + +## Measurement Options + +#### `--measurement-mode=[time_windows|count_windows]` + +Specifies the mode used for stabilizing measurements. 'time_windows' will +create windows such that the duration of each window is equal to +`--measurement-interval`. 'count_windows' will create windows such that there +are at least `--measurement-request-count` requests in each window and that +the window is at least one second in duration (adding more requests if +necessary). + +Default is `time_windows`. + +#### `-p ` +#### `--measurement-interval=` + +Specifies the time interval used for each measurement in milliseconds when +`--measurement-mode=time_windows` is used. Perf Analyzer will sample a time +interval specified by this option and take measurement over the requests +completed within that time interval. + +Default is `5000`. + +#### `--measurement-request-count=` + +Specifies the minimum number of requests to be collected in each measurement +window when `--measurement-mode=count_windows` is used. + +Default is `50`. + +#### `-s ` +#### `--stability-percentage=` + +Specifies the allowed variation in latency measurements when determining if a +result is stable. The measurement is considered stable if the ratio of max / +min from the recent 3 measurements is within (stability percentage)% in terms +of both inferences per second and latency. + +Default is `10`(%). + +#### `--percentile=` + +Specifies the confidence value as a percentile that will be used to determine +if a measurement is stable. For example, a value of `85` indicates that the +85th percentile latency will be used to determine stability. The percentile +will also be reported in the results. + +Default is `-1` indicating that the average latency is used to determine +stability. + +#### `-r ` +#### `--max-trials=` + +Specifies the maximum number of measurements when attempting to reach stability +of inferences per second and latency for each concurrency or request rate +during the search. Perf Analyzer will terminate if the measurement is still +unstable after the maximum number of trials. + +Default is `10`. + +#### `--concurrency-range=` + +Specifies the range of concurrency levels covered by Perf Analyzer. Perf +Analyzer will start from the concurrency level of 'start' and go until 'end' +with a stride of 'step'. + +Default of 'end' and 'step' are `1`. If 'end' is not specified then Perf +Analyzer will run for a single concurrency level determined by 'start'. If +'end' is set as `0`, then the concurrency limit will be incremented by 'step' +until the latency threshold is met. 'end' and `--latency-threshold` cannot +both be `0`. 'end' cannot be `0` for sequence models while using asynchronous +mode. + +#### `--request-rate-range=` + +Specifies the range of request rates for load generated by Perf Analyzer. This +option can take floating-point values. The search along the request rate range +is enabled only when using this option. + +If not specified, then Perf Analyzer will search along the concurrency range. +Perf Analyzer will start from the request rate of 'start' and go until 'end' +with a stride of 'step'. Default values of 'start', 'end' and 'step' are all +`1.0`. If 'end' is not specified, then Perf Analyzer will run for a single +request rate as determined by 'start'. If 'end' is set as `0.0`, then the +request rate will be incremented by 'step' until the latency threshold is met. +'end' and `--latency-threshold` can not be both `0`. + +#### `--request-distribution=[constant|poisson]` + +Specifies the time interval distribution between dispatching inference requests +to the server. Poisson distribution closely mimics the real-world work load on +a server. This option is ignored if not using `--request-rate-range`. + +Default is `constant`. + +#### `-l ` +#### `--latency-threshold=` + +Specifies the limit on the observed latency, in milliseconds. Perf Analyzer +will terminate the concurrency or request rate search once the measured latency +exceeds this threshold. + +Default is `0` indicating that Perf Analyzer will run for the entire +concurrency or request rate range. + +#### `--binary-search` + +Enables binary search on the specified search range (concurrency or request +rate). This option requires 'start' and 'end' to be expilicitly specified in +the concurrency range or request rate range. When using this option, 'step' is +more like the precision. When the 'step' is lower, there are more iterations +along the search path to find suitable convergence. + +When `--binary-search` is not specified, linear search is used. + +#### `--request-intervals=` + +Specifies a path to a file containing time intervals in microseconds. Each time +interval should be in a new line. Perf Analyzer will try to maintain time +intervals between successive generated requests to be as close as possible in +this file. This option can be used to apply custom load to server with a +certain pattern of interest. Perf Analyzer will loop around the file if the +duration of execution exceeds the amount of time specified by the intervals. +This option can not be used with `--request-rate-range` or +`--concurrency-range`. + +#### `--max-threads=` + +Specifies the maximum number of threads that will be created for providing +desired concurrency or request rate. However, when running in synchronous mode +with `--concurrency-range` having explicit 'end' specification, this value will +be ignored. + +Default is `4` if `--request-rate-range` is specified, otherwise default is +`16`. + +## Sequence Model Options + +#### `--num-of-sequences=` + +Specifies the number of concurrent sequences for sequence models. This option +is ignored when `--request-rate-range` is not specified. + +Default is `4`. + +#### `--sequence-length=` + +Specifies the base length of a sequence used for sequence models. A sequence +with length X will be composed of X requests to be sent as the elements in the +sequence. The actual length of the sequencewill be within +/- Y% of the base +length, where Y defaults to 20% and is customizable via +`--sequence-length-variation`. If sequence length is unspecified and input data +is provided, the sequence length will be the number of inputs in the +user-provided input data. + +Default is `20`. + +#### `--sequence-length-variation=` + +Specifies the percentage variation in length of sequences. This option is only +valid when not using user-provided input data or when `--sequence-length` is +specified while using user-provided input data. + +Default is `20`(%). + +#### `--sequence-id-range=` + +Specifies the range of sequence IDs used by Perf Analyzer. Perf Analyzer will +start from the sequence ID of 'start' and go until 'end' (excluded). If 'end' +is not specified then Perf Analyzer will generate new sequence IDs without +bounds. If 'end' is specified and the concurrency setting may result in +maintaining a number of sequences more than the range of available sequence +IDs, Perf Analyzer will exit with an error due to possible sequence ID +collisions. + +The default for 'start is `1`, and 'end' is not specified (no bounds). + +## Input Data Options + +#### `--input-data=[zero|random|]` + +Specifies type of data that will be used for input in inference requests. The +available options are `zero`, `random`, and a path to a directory or a JSON +file. + +When pointing to a JSON file, the user must adhere to the format described in +the [input data documentation](input_data.md). By specifying JSON data, users +can control data used with every request. Multiple data streams can be specified +for a sequence model, and Perf Analyzer will select a data stream in a +round-robin fashion for every new sequence. Muliple JSON files can also be +provided (`--input-data json_file1.json --input-data json_file2.json` and so on) +and Perf Analyzer will append data streams from each file. When using +`--service-kind=torchserve`, make sure this option points to a JSON file. + +If the option is path to a directory then the directory must contain a binary +text file for each non-string/string input respectively, named the same as the +input. Each file must contain the data required for that input for a batch-1 +request. Each binary file should contain the raw binary representation of the +input in row-major order for non-string inputs. The text file should contain +all strings needed by batch-1, each in a new line, listed in row-major order. + +Default is `random`. + +#### `-b ` + +Specifies the batch size for each request sent. + +Default is `1`. + +#### `--shape=` + +Specifies the shape used for the specified input. The argument must be +specified as 'name:shape' where the shape is a comma-separated list for +dimension sizes. For example `--shape=input_name:1,2,3` indicates that the +input `input_name` has tensor shape [ 1, 2, 3 ]. `--shape` may be specified +multiple times to specify shapes for different inputs. + +#### `--string-data=` + +Specifies the string to initialize string input buffers. Perf Analyzer will +replicate the given string to build tensors of required shape. +`--string-length` will not have any effect. This option is ignored if +`--input-data` points to a JSON file or directory. + +#### `--string-length=` + +Specifies the length of the random strings to be generated by Perf Analyzer +for string input. This option is ignored if `--input-data` points to a +JSON file or directory. + +Default is `128`. + +#### `--shared-memory=[none|system|cuda]` + +Specifies the type of the shared memory to use for input and output data. + +Default is `none`. + +#### `--output-shared-memory-size=` + +Specifies The size, in bytes, of the shared memory region to allocate per +output tensor. Only needed when one or more of the outputs are of string type +and/or variable shape. The value should be larger than the size of the largest +output tensor that the model is expected to return. Perf Analyzer will use the +following formula to calculate the total shared memory to allocate: +output_shared_memory_size * number_of_outputs * batch_size. + +Default is `102400` (100 KB). + +## Request Options + +#### `-i [http|grpc]` + +Specifies the communication protocol to use. The available protocols are gRPC +and HTTP. + +Default is `http`. + +#### `-a` +#### `--async` + +Enables asynchronous mode in Perf Analyzer. + +By default, Perf Analyzer will use a synchronous request API for inference. +However, if the model is sequential, then the default mode is asynchronous. +Specify `--sync` to operate sequential models in synchronous mode. In +synchronous mode, Perf Analyzer will start threads equal to the concurrency +level. Use asynchronous mode to limit the number of threads, yet maintain the +concurrency. + +#### `--sync` + +Enables synchronous mode in Perf Analyzer. Can be used to operate Perf +Analyzer with sequential model in synchronous mode. + +#### `--streaming` + +Enables the use of streaming API. This option is only valid with gRPC protocol. + +Default is `false`. + +#### `-H ` + +Specifies the header that will be added to HTTP requests (ignored for gRPC +requests). The header must be specified as 'Header:Value'. `-H` may be +specified multiple times to add multiple headers. + +#### `--grpc-compression-algorithm=[none|gzip|deflate]` + +Specifies the compression algorithm to be used by gRPC when sending requests. +Only supported when gRPC protocol is being used. + +Default is `none`. + +## Server Options + +#### `-u ` + +Specifies the URL for the server. + +Default is `localhost:8000` when using `--service-kind=triton` with HTTP. +Default is `localhost:8001` when using `--service-kind=triton` with gRPC. +Default is `localhost:8500` when using `--service-kind=tfserving`. + +#### `--ssl-grpc-use-ssl` + +Enables usage of an encrypted channel to the server. + +#### `--ssl-grpc-root-certifications-file=` + +Specifies the path to file containing the PEM encoding of the server root +certificates. + +#### `--ssl-grpc-private-key-file=` + +Specifies the path to file containing the PEM encoding of the client's private +key. + +#### `--ssl-grpc-certificate-chain-file=` + +Specifies the path to file containing the PEM encoding of the client's +certificate chain. + +#### `--ssl-https-verify-peer=[0|1]` + +Specifies whether to verify the peer's SSL certificate. See +https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html for the meaning of each +value. + +Default is `1`. + +#### `--ssl-https-verify-host=[0|1|2]` + +Specifies whether to verify the certificate's name against host. See +https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html for the meaning of each +value. + +Default is `2`. + +#### `--ssl-https-ca-certificates-file=` + +Specifies the path to Certificate Authority (CA) bundle. + +#### `--ssl-https-client-certificate-file=` + +Specifies the path to the SSL client certificate. + +#### `--ssl-https-client-certificate-type=[PEM|DER]` + +Specifies the type of the client SSL certificate. + +Default is `PEM`. + +#### `--ssl-https-private-key-file=` + +Specifies the path to the private keyfile for TLS and SSL client cert. + +#### `--ssl-https-private-key-type=[PEM|DER]` + +Specifies the type of the private key file. + +Default is `PEM`. + +#### `--triton-server-directory=` + +Specifies the Triton server install path. Required by and only used when C API +is used (`--service-kind=triton_c_api`). + +Default is `/opt/tritonserver`. + +#### `--model-repository=` + +Specifies the model repository directory path for loading models. Required by +and only used when C API is used (`--service-kind=triton_c_api`). + +## Prometheus Metrics Options + +#### `--collect-metrics` + +Enables the collection of server-side inference server metrics. Perf Analyzer +will output metrics in the CSV file generated with the `-f` option. Only valid +when `--verbose-csv` option also used. + +#### `--metrics-url=` + +Specifies the URL to query for server-side inference server metrics. + +Default is `localhost:8002/metrics`. + +#### `--metrics-interval=` + +Specifies how often within each measurement window, in milliseconds, Perf +Analyzer should query for server-side inference server metrics. + +Default is `1000`. + +## Report Options + +#### `-f ` + +Specifies the path that the latency report file will be generated at. + +When `-f` is not specified, a latency report will not be generated. + +#### `--verbose-csv` + +Enables additional information being output to the CSV file generated by Perf +Analyzer. + +## Trace Options + +#### `--trace-file=` + +Specifies the file where trace output will be saved. + +If `--trace-log-frequency` is also specified, this argument value will be the +prefix of the files to save the trace output. See `--trace-log-frequency` for +details. Only used for `--service-kind=triton`. + +#### `--trace-level=[OFF|TIMESTAMPS|TENSORS]` + +Specifies a trace level. `OFF` disables tracing. `TIMESTAMPS` traces +timestamps. `TENSORS` traces tensors. It may be specified multiple times to +trace multiple informations. + +Default is `OFF`. + +#### `--trace-rate=` + +Specifies the trace sampling rate (traces per second). + +Default is `1000`. + +#### `--trace-count=` + +Specifies the number of traces to be sampled. If the value is `-1`, the number +of traces to be sampled will not be limited. + +Default is `-1`. + +#### `--log-frequency=` + +Specifies the trace log frequency. If the value is `0`, Triton will only log +the trace output to path specified via `--trace-file` when shutting down. +Otherwise, Triton will log the trace output to the path specified via +`--trace-file`. when it collects the specified number of traces. For +example, if `--trace-file` is specified to be `trace_file.log`, and if the log +frequency is `100`, when Triton collects the 100th trace, it logs the traces +to file `trace_file.log.0`, and when it collects the 200th trace, it logs the +101st to the 200th traces to file `trace_file.log.1`. + +Default is `0`. + +## Deprecated Options + +#### `--data-directory=` + +**DEPRECATED** + +Alias for `--input-data=` where `` is the path to a directory. See +`--input-data` option documentation for details. + +#### `-c ` + +**DEPRECATED** + +Specifies the maximum concurrency that Perf Analyzer will search up to. Cannot +be used with `--concurrency-range`. + +#### `-d` + +**DEPRECATED** + +Enables dynamic concurrency mode. Perf Analyzer will search along +concurrencies up to the maximum concurrency specified via `-c `. Cannot be +used with `--concurrency-range`. + +#### `-t ` + +**DEPRECATED** + +Specifies the number of concurrent requests. Cannot be used with +`--concurrency-range`. + +Default is `1`. + +#### `-z` + +**DEPRECATED** + +Alias for `--input-data=zero`. See `--input-data` option documentation for +details. diff --git a/src/c++/perf_analyzer/docs/inference_load_modes.md b/src/c++/perf_analyzer/docs/inference_load_modes.md new file mode 100644 index 000000000..8b119cea6 --- /dev/null +++ b/src/c++/perf_analyzer/docs/inference_load_modes.md @@ -0,0 +1,66 @@ + + +# Inference Load Modes + +Perf Analyzer has several modes for generating inference request load for a +model. + +## Concurrency Mode + +In concurrency mode, Perf Analyzer attempts to send inference requests to the +server such that N requests are always outstanding during profiling. For +example, when using +[`--concurrency-range=4`](cli.md#--concurrency-rangestartendstep), Perf Analyzer +will to attempt to have 4 outgoing inference requests at all times during +profiling. + +## Request Rate Mode + +In request rate mode, Perf Analyzer attempts to send N inference requests per +second to the server during profiling. For example, when using +[`--request-rate-range=20](cli.md#--request-rate-rangestartendstep), Perf +Analyzer will attempt to send 20 requests per second during profiling. + +## Custom Interval Mode + +In custom interval mode, Perf Analyzer attempts to send inference requests +according to intervals (between requests, looping if necessary) provided by the +user in the form of a text file with one time interval (in microseconds) per +line. For example, when using +[`--request-intervals=my_intervals.txt`](cli.md#--request-intervalspath), +where `my_intervals.txt` contains: + +``` +100000 +200000 +500000 +``` + +Perf Analyzer will attempt to send requests at the following times: 0.1s, 0.3s, +0.8s, 0.9s, 1.1s, 1.6s, and so on, during profiling. diff --git a/src/c++/perf_analyzer/docs/input_data.md b/src/c++/perf_analyzer/docs/input_data.md new file mode 100644 index 000000000..83a305c10 --- /dev/null +++ b/src/c++/perf_analyzer/docs/input_data.md @@ -0,0 +1,305 @@ + + +# Input Data + +Use the [`--help`](cli.md#--help) option to see complete documentation for all +input data options. By default Perf Analyzer sends random data to all the inputs +of your model. You can select a different input data mode with the +[`--input-data`](cli.md#--input-datazerorandompath) option: + +- _random_: (default) Send random data for each input. Note: Perf Analyzer only + generates random data once per input and reuses that for all inferences +- _zero_: Send zeros for each input. +- directory path: A path to a directory containing a binary file for each input, + named the same as the input. Each binary file must contain the data required + for that input for a batch-1 request. Each file should contain the raw binary + representation of the input in row-major order. +- file path: A path to a JSON file containing data to be used with every + inference request. See the "Real Input Data" section for further details. + [`--input-data`](cli.md#--input-datazerorandompath) can be provided multiple + times with different file paths to specific multiple JSON files. + +For tensors with with `STRING`/`BYTES` datatype, the +[--string-length](cli.md#--string-lengthn) and +[`--string-data`](cli.md#--string-datastring) options may be used in some cases +(see [`--help`](cli.md#--help) for full documentation). + +For models that support batching you can use the [`-b`](cli.md#-b-n) option to +indicate the batch size of the requests that Perf Analyzer should send. For +models with variable-sized inputs you must provide the +[`--shape`](cli.md#--shapestring) argument so that Perf Analyzer knows what +shape tensors to use. For example, for a model that has an input called +`IMAGE` that has shape `[3, N, M]`, where `N` and `M` are variable-size +dimensions, to tell Perf Analyzer to send batch size 4 requests of shape +`[3, 224, 224]`: + +``` +$ perf_analyzer -m mymodel -b 4 --shape IMAGE:3,224,224 +``` + +## Real Input Data + +The performance of some models is highly dependent on the data used. For such +cases you can provide data to be used with every inference request made by Perf +Analyzer in a JSON file. Perf Analyzer will use the provided data in a +round-robin order when sending inference requests. For sequence models, if a +sequence length is specified via +[`--sequence-length`](cli.md#--sequence-lengthn), Perf Analyzer will also loop +through the provided data in a round-robin order up to the specified sequence +length (with a percentage variation customizable via +[`--sequence-length-variation`](cli.md#--sequence-length-variationn)). +Otherwise, the sequence length will be the number of inputs specified in +user-provided input data. + +Each entry in the `"data"` array must specify all input tensors with the exact +size expected by the model for a single batch. The following example describes +data for a model with inputs named, `INPUT0` and `INPUT1`, shape `[4, 4]` and +data type `INT32`: + +```json +{ + "data": + [ + { + "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + }, + { + "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + }, + { + "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + }, + { + "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + } + ] +} +``` + +Note that the `[4, 4]` tensor has been flattened in a row-major format for the +inputs. In addition to specifying explicit tensors, you can also provide Base64 +encoded binary data for the tensors. Each data object must list its data in a +row-major order. Binary data must be in little-endian byte order. The following +example highlights how this can be acheived: + +```json +{ + "data": + [ + { + "INPUT0": {"b64": "YmFzZTY0IGRlY29kZXI="}, + "INPUT1": {"b64": "YmFzZTY0IGRlY29kZXI="} + }, + { + "INPUT0": {"b64": "YmFzZTY0IGRlY29kZXI="}, + "INPUT1": {"b64": "YmFzZTY0IGRlY29kZXI="} + }, + { + "INPUT0": {"b64": "YmFzZTY0IGRlY29kZXI="}, + "INPUT1": {"b64": "YmFzZTY0IGRlY29kZXI="} + } + ] +} +``` + +In case of sequence models, multiple data streams can be specified in the JSON +file. Each sequence will get a data stream of its own and Perf Analyzer will +ensure the data from each stream is played back to the same correlation ID. The +below example highlights how to specify data for multiple streams for a sequence +model with a single input named `INPUT`, shape `[1]` and data type `STRING`: + +```json +{ + "data": + [ + [ + { + "INPUT": ["1"] + }, + { + "INPUT": ["2"] + }, + { + "INPUT": ["3"] + }, + { + "INPUT": ["4"] + } + ], + [ + { + "INPUT": ["1"] + }, + { + "INPUT": ["1"] + }, + { + "INPUT": ["1"] + } + ], + [ + { + "INPUT": ["1"] + }, + { + "INPUT": ["1"] + } + ] + ] +} +``` + +The above example describes three data streams with lengths 4, 3 and 2 +respectively. Perf Analyzer will hence produce sequences of length 4, 3 and 2 in +this case. + +You can also provide an optional `"shape"` field to the tensors. This is +especially useful while profiling the models with variable-sized tensors as +input. Additionally note that when providing the `"shape"` field, tensor +contents must be provided separately in a "content" field in row-major order. +The specified shape values will override default input shapes provided as a +command line option (see [`--shape`](cli.md#--shapestring)) for variable-sized +inputs. In the absence of a `"shape"` field, the provided defaults will be used. +There is no need to specify shape as a command line option if all the input data +provide shape values for variable tensors. Below is an example JSON file for a +model with a single input `INPUT`, shape `[-1, -1]` and data type `INT32`: + +```json +{ + "data": + [ + { + "INPUT": + { + "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "shape": [2,8] + } + }, + { + "INPUT": + { + "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "shape": [8,2] + } + }, + { + "INPUT": + { + "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + } + }, + { + "INPUT": + { + "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "shape": [4,4] + } + } + ] +} +``` + +The following is the example to provide contents as base64 string with explicit +shapes: + +```json +{ + "data": + [ + { + "INPUT": + { + "content": {"b64": "/9j/4AAQSkZ(...)"}, + "shape": [7964] + } + }, + { + "INPUT": + { + "content": {"b64": "/9j/4AAQSkZ(...)"}, + "shape": [7964] + } + } + ] +} +``` + +Note that for `STRING` type, an element is represented by a 4-byte unsigned +integer giving the length followed by the actual bytes. The byte array to be +encoded using base64 must include the 4-byte unsigned integers. + +### Output Validation + +When real input data is provided, it is optional to request Perf Analyzer to +validate the inference output for the input data. + +Validation output can be specified in the `"validation_data"` field have the +same format as the `"data"` field for real input. Note that the entries in +`"validation_data"` must align with `"data"` for proper mapping. The following +example describes validation data for a model with inputs named `INPUT0` and +`INPUT1`, outputs named `OUTPUT0` and `OUTPUT1`, all tensors have shape `[4, 4]` +and data type `INT32`: + +```json +{ + "data": + [ + { + "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] + } + ], + "validation_data": + [ + { + "OUTPUT0": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + "OUTPUT1": [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] + } + ] +} +``` + +Besides the above example, the validation outputs can be specified in the same +variations described in the real input data section. + +# Shared Memory + +By default Perf Analyzer sends input tensor data and receives output tensor data +over the network. You can instead instruct Perf Analyzer to use system shared +memory or CUDA shared memory to communicate tensor data. By using these options +you can model the performance that you can achieve by using shared memory in +your application. Use +[`--shared-memory=system`](cli.md#--shared-memorynonesystemcuda) to use system +(CPU) shared memory or +[`--shared-memory=cuda`](cli.md#--shared-memorynonesystemcuda) to use CUDA +shared memory. diff --git a/src/c++/perf_analyzer/docs/install.md b/src/c++/perf_analyzer/docs/install.md new file mode 100644 index 000000000..b5d84a62a --- /dev/null +++ b/src/c++/perf_analyzer/docs/install.md @@ -0,0 +1,106 @@ + + +# Recommended Installation Method + +## Triton SDK Container + +The recommended way to "install" Perf Analyzer is to run the pre-built +executable from within the Triton SDK docker container available on the +[NVIDIA GPU Cloud Catalog](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver). +As long as the SDK container has its network exposed to the address and port of +the inference server, Perf Analyzer will be able to run. + +```bash +export RELEASE= # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02` + +docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + +docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + +# inside container +perf_analyzer -m +``` + +# Alternative Installation Methods + +- [Pip](#pip) +- [Build from Source](#build-from-source) + +## Pip + +```bash +pip install tritonclient + +perf_analyzer -m +``` + +**Warning**: If any runtime dependencies are missing, Perf Analyzer will produce +errors showing which ones are missing. You will need to manually install them. + +## Build from Source + +The Triton SDK container is used for building, so some build and runtime +dependencies are already installed. + +```bash +export RELEASE= # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02` + +docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + +docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + +# inside container +# prep installing newer version of cmake +wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null ; apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main' + +# install build/runtime dependencies +apt update ; apt install -y cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1 libcurl4-openssl-dev rapidjson-dev + +rm -rf client ; git clone --depth 1 https://github.com/triton-inference-server/client + +mkdir client/build ; cd client/build + +cmake -DTRITON_ENABLE_PERF_ANALYZER=ON .. + +make -j8 cc-clients + +perf_analyzer -m +``` + +- To enable + [CUDA shared memory](input_data.md#shared-memory), add + `-DTRITON_ENABLE_GPU=ON` to the `cmake` command. +- To enable + [C API mode](benchmarking.md#benchmarking-triton-directly-via-c-api), add + `-DTRITON_ENABLE_PERF_ANALYZER_C_API=ON` to the `cmake` command. +- To enable [TorchServe backend](benchmarking.md#benchmarking-torchserve), add + `-DTRITON_ENABLE_PERF_ANALYZER_TS=ON` to the `cmake` command. +- To enable + [Tensorflow Serving backend](benchmarking.md#benchmarking-tensorflow-serving), + add `-DTRITON_ENABLE_PERF_ANALYZER_TFS=ON` to the `cmake` command. diff --git a/src/c++/perf_analyzer/docs/measurements_metrics.md b/src/c++/perf_analyzer/docs/measurements_metrics.md new file mode 100644 index 000000000..dd6b1ee72 --- /dev/null +++ b/src/c++/perf_analyzer/docs/measurements_metrics.md @@ -0,0 +1,224 @@ + + +# Measurement Modes + +Currently, Perf Analyzer has 2 measurement modes. + +## Time Windows + +When using time windows measurement mode +([`--measurement-mode=time_windows`](cli.md#--measurement-modetime_windowscount_windows)), +Perf Analyzer will count how many requests have completed during a window of +duration `X` (in milliseconds, via `--measurement-interval=X`, default is +`5000`). This is the default measurement mode. + +## Count Windows + +When using count windows measurement mode +([`--measurement-mode=count_windows`](cli.md#--measurement-modetime_windowscount_windows)), +Perf Analyzer will start the window duration at 1 second and potentially +dynamically increase it until `X` requests have completed (via +[`--measurement-request-count=X`](cli.md#--measurement-request-countn), default +is `50`). + +# Metrics + +## How Throughput is Calculated + +Perf Analyzer calculates throughput to be the total number of requests completed +during a measurement, divided by the duration of the measurement, in seconds. + +## How Latency is Calculated + +For each request concurrency level Perf Analyzer reports latency and throughput +as seen from Perf Analyzer and also the average request latency on the server. + +The server latency measures the total time from when the request is received at +the server until when the response is sent from the server. Because of the HTTP +and gRPC libraries used to implement the server endpoints, total server latency +is typically more accurate for HTTP requests as it measures time from the first +byte received until last byte sent. For both HTTP and gRPC the total server +latency is broken-down into the following components: + +- _queue_: The average time spent in the inference schedule queue by a request + waiting for an instance of the model to become available. +- _compute_: The average time spent performing the actual inference, including + any time needed to copy data to/from the GPU. +- _overhead_: The average time spent in the endpoint that cannot be correctly + captured in the send/receive time with the way the gRPC and HTTP libraries are + structured. + +The client latency time is broken-down further for HTTP and gRPC as follows: + +- HTTP: _send/recv_ indicates the time on the client spent sending the request + and receiving the response. _response wait_ indicates time waiting for the + response from the server. +- gRPC: _(un)marshal request/response_ indicates the time spent marshalling the + request data into the gRPC protobuf and unmarshalling the response data from + the gRPC protobuf. _response wait_ indicates time writing the gRPC request to + the network, waiting for the response, and reading the gRPC response from the + network. + +Use the verbose ([`-v`](cli.md#-v)) option see more output, including the +stabilization passes run for each request concurrency level or request rate. + +# Reports + +## Visualizing Latency vs. Throughput + +Perf Analyzer provides the [`-f`](cli.md#-f-path) option to generate a file +containing CSV output of the results. + +``` +$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv +... +$ cat perf.csv +Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency +1,69.2,225,2148,64,206,11781,19,0,13891,18795,19753,21018 +3,84.2,237,1768,21673,209,11742,17,0,35398,43984,47085,51701 +4,84.2,279,1604,33669,233,11731,18,1,47045,56545,59225,64886 +2,87.2,235,1973,9151,190,11346,17,0,21874,28557,29768,34766 +``` + +NOTE: The rows in the CSV file are sorted in an increasing order of throughput +(Inferences/Second). + +You can import the CSV file into a spreadsheet to help visualize the latency vs +inferences/second tradeoff as well as see some components of the latency. Follow +these steps: + +- Open + [this spreadsheet](https://docs.google.com/spreadsheets/d/1S8h0bWBBElHUoLd2SOvQPzZzRiQ55xjyqodm_9ireiw) +- Make a copy from the File menu "Make a copy..." +- Open the copy +- Select the A1 cell on the "Raw Data" tab +- From the File menu select "Import..." +- Select "Upload" and upload the file +- Select "Replace data at selected cell" and then select the "Import data" + button + +## Server-side Prometheus metrics + +Perf Analyzer can collect +[server-side metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md#gpu-metrics), +such as GPU utilization and GPU power usage. To enable the collection of these +metrics, use the [`--collect-metrics`](cli.md#--collect-metrics) option. + +By default, Perf Analyzer queries the metrics endpoint at the URL +`localhost:8002/metrics`. If the metrics are accessible at a different url, use +the [`--metrics-url=`](cli.md#--metrics-urlurl) option to specify that. + +By default, Perf Analyzer queries the metrics endpoint every 1000 milliseconds. +To use a different querying interval, use the +[`--metrics-interval=`](cli.md#--metrics-intervaln) option (specify in +milliseconds). + +Because Perf Analyzer can collect the server-side metrics multiple times per +run, these metrics are aggregated in specific ways to produce one final number +per searched concurrency or request rate. Here are how the metrics are +aggregated: + +| Metric | Aggregation | +| - | - | +| GPU Utilization | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. | +| GPU Power Usage | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. | +| GPU Used Memory | Maximum from all collections taken during a stable pass. Users are typically curious what the peak memory usage is for determining model/hardware viability. | +| GPU Total Memory | First from any collection taken during a stable pass. All of the collections should produce the same value for total memory available on the GPU. | + +Note that all metrics are per-GPU in the case of multi-GPU systems. + +To output these server-side metrics to a CSV file, use the +[`-f `](cli.md#-f-path) and [`--verbose-csv`](cli.md#--verbose-csv) +options. The output CSV will contain one column per metric. The value of each +column will be a `key:value` pair (`GPU UUID:metric value`). Each `key:value` +pair will be delimited by a semicolon (`;`) to indicate metric values for each +GPU accessible by the server. There is a trailing semicolon. See below: + +`:;:;...;` + +Here is a simplified CSV output: + +``` +$ perf_analyzer -m resnet50_libtorch --collect-metrics -f output.csv --verbose-csv +$ cat output.csv +Concurrency,...,Avg GPU Utilization,Avg GPU Power Usage,Max GPU Memory Usage,Total GPU Memory +1,...,gpu_uuid_0:0.33;gpu_uuid_1:0.5;,gpu_uuid_0:55.3;gpu_uuid_1:56.9;,gpu_uuid_0:10000;gpu_uuid_1:11000;,gpu_uuid_0:50000;gpu_uuid_1:75000;, +2,...,gpu_uuid_0:0.25;gpu_uuid_1:0.6;,gpu_uuid_0:25.6;gpu_uuid_1:77.2;,gpu_uuid_0:11000;gpu_uuid_1:17000;,gpu_uuid_0:50000;gpu_uuid_1:75000;, +3,...,gpu_uuid_0:0.87;gpu_uuid_1:0.9;,gpu_uuid_0:87.1;gpu_uuid_1:71.7;,gpu_uuid_0:15000;gpu_uuid_1:22000;,gpu_uuid_0:50000;gpu_uuid_1:75000;, +``` + +## Communication Protocol + +By default, Perf Analyzer uses HTTP to communicate with Triton. The gRPC +protocol can be specificed with the [`-i [http|grpc]`](cli.md#-i-httpgrpc) +option. If gRPC is selected the [`--streaming`](cli.md#--streaming) option can +also be specified for gRPC streaming. + +### SSL/TLS Support + +Perf Analyzer can be used to benchmark Triton service behind SSL/TLS-enabled +endpoints. These options can help in establishing secure connection with the +endpoint and profile the server. + +For gRPC, see the following options: + +- [`--ssl-grpc-use-ssl`](cli.md#--ssl-grpc-use-ssl) +- [`--ssl-grpc-root-certifications-file=`](cli.md#--ssl-grpc-root-certifications-filepath) +- [`--ssl-grpc-private-key-file=`](cli.md#--ssl-grpc-private-key-filepath) +- [`--ssl-grpc-certificate-chain-file=`](cli.md#--ssl-grpc-certificate-chain-filepath) + +More details here: +https://grpc.github.io/grpc/cpp/structgrpc_1_1_ssl_credentials_options.html + +The +[inference protocol gRPC SSL/TLS section](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#ssltls) +describes server-side options to configure SSL/TLS in Triton's gRPC endpoint. + +For HTTPS, the following options are exposed: + +- [`--ssl-https-verify-peer`](cli.md#--ssl-https-verify-peer01) +- [`--ssl-https-verify-host`](cli.md#--ssl-https-verify-host012) +- [`--ssl-https-ca-certificates-file`](cli.md#--ssl-https-ca-certificates-filepath) +- [`--ssl-https-client-certificate-file`](cli.md#--ssl-https-client-certificate-filepath) +- [`--ssl-https-client-certificate-type`](cli.md#--ssl-https-client-certificate-typepemder) +- [`--ssl-https-private-key-file`](cli.md#--ssl-https-private-key-filepath) +- [`--ssl-https-private-key-type`](cli.md#--ssl-https-private-key-typepemder) + +See [`--help`](cli.md#--help) for full documentation. + +Unlike gRPC, Triton's HTTP server endpoint can not be configured with SSL/TLS +support. + +Note: Just providing these `--ssl-http-*` options to Perf Analyzer does not +ensure that SSL/TLS is used in communication. If SSL/TLS is not enabled on the +service endpoint, these options have no effect. The intent of exposing these +options to a user of Perf Analyzer is to allow them to configure Perf Analyzer +to benchmark a Triton service behind SSL/TLS-enabled endpoints. In other words, +if Triton is running behind a HTTPS server proxy, then these options would allow +Perf Analyzer to profile Triton via exposed HTTPS proxy. diff --git a/src/c++/perf_analyzer/docs/quick_start.md b/src/c++/perf_analyzer/docs/quick_start.md new file mode 100644 index 000000000..cfcc2b3d1 --- /dev/null +++ b/src/c++/perf_analyzer/docs/quick_start.md @@ -0,0 +1,114 @@ + + +# Quick Start + +The steps below will guide you on how to start using Perf Analyzer. + +### Step 1: Start Triton Container + +```bash +export RELEASE= # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02` + +docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3 + +docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3 +``` + +### Step 2: Download `simple` Model + +```bash +# inside triton container +git clone --depth 1 https://github.com/triton-inference-server/server + +mkdir model_repository ; cp -r server/docs/examples/model_repository/simple model_repository +``` + +### Step 3: Start Triton Server + +```bash +# inside triton container +tritonserver --model-repository $(pwd)/model_repository &> server.log & + +# confirm server is ready, look for 'HTTP/1.1 200 OK' +curl -v localhost:8000/v2/health/ready + +# detatch (CTRL-p CTRL-q) +``` + +### Step 4: Start Triton SDK Container + +```bash +docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk + +docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk +``` + +### Step 5: Run Perf Analyzer + +```bash +# inside sdk container +perf_analyzer -m simple +``` + +### Step 6: Observe and Analyze Output + +``` +$ perf_analyzer -m simple +*** Measurement Settings *** + Batch size: 1 + Service Kind: Triton + Using "time_windows" mode for stabilization + Measurement window: 5000 msec + Using synchronous calls for inference + Stabilizing using average latency + +Request concurrency: 1 + Client: + Request count: 25348 + Throughput: 1407.84 infer/sec + Avg latency: 708 usec (standard deviation 663 usec) + p50 latency: 690 usec + p90 latency: 881 usec + p95 latency: 926 usec + p99 latency: 1031 usec + Avg HTTP time: 700 usec (send/recv 102 usec + response wait 598 usec) + Server: + Inference count: 25348 + Execution count: 25348 + Successful request count: 25348 + Avg request latency: 382 usec (overhead 41 usec + queue 41 usec + compute input 26 usec + compute infer 257 usec + compute output 16 usec) + +Inferences/Second vs. Client Average Batch Latency +Concurrency: 1, throughput: 1407.84 infer/sec, latency 708 usec +``` + +We can see from the output that the model was able to complete approximately +1407.84 inferences per second, with an average latency of 708 microseconds per +inference request. Concurrency of 1 meant that Perf Analyzer attempted to always +have 1 outgoing request at all times.