diff --git a/src/c++/perf_analyzer/README.md b/src/c++/perf_analyzer/README.md
index 4bc9b3472..35fbf4720 100644
--- a/src/c++/perf_analyzer/README.md
+++ b/src/c++/perf_analyzer/README.md
@@ -1,752 +1,170 @@
 <!--
-# Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+Copyright (c) 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 
-# Performance Analyzer
-
-A critical part of optimizing the inference performance of your model
-is being able to measure changes in performance as you experiment with
-different optimization strategies. The perf_analyzer application
-(previously known as perf_client) performs this task for the Triton
-Inference Server. The perf_analyzer is included with the client
-examples which are [available from several
-sources](https://github.com/triton-inference-server/client#getting-the-client-libraries-and-examples).
-
-The perf_analyzer application generates inference requests to your
-model and measures the throughput and latency of those requests. To
-get representative results, perf_analyzer measures the throughput and
-latency over a time window, and then repeats the measurements until it
-gets stable values. By default perf_analyzer uses average latency to
-determine stability but you can use the --percentile flag to stabilize
-results based on that confidence level. For example, if
---percentile=95 is used the results will be stabilized using the 95-th
-percentile request latency. For example,
+# Triton Performance Analyzer
 
-```
-$ perf_analyzer -m inception_graphdef --percentile=95
-*** Measurement Settings ***
-  Batch size: 1
-  Measurement window: 5000 msec
-  Using synchronous calls for inference
-  Stabilizing using p95 latency
-
-Request concurrency: 1
-  Client:
-    Request count: 348
-    Throughput: 69.6 infer/sec
-    p50 latency: 13936 usec
-    p90 latency: 18682 usec
-    p95 latency: 19673 usec
-    p99 latency: 21859 usec
-    Avg HTTP time: 14017 usec (send/recv 200 usec + response wait 13817 usec)
-  Server:
-    Inference count: 428
-    Execution count: 428
-    Successful request count: 428
-    Avg request latency: 12005 usec (overhead 36 usec + queue 42 usec + compute input 164 usec + compute infer 11748 usec + compute output 15 usec)
-
-Inferences/Second vs. Client p95 Batch Latency
-Concurrency: 1, throughput: 69.6 infer/sec, latency 19673 usec
-```
-
-## Request Concurrency
-
-By default perf_analyzer measures your model's latency and throughput
-using the lowest possible load on the model. To do this perf_analyzer
-sends one inference request to Triton and waits for the response.
-When that response is received, the perf_analyzer immediately sends
-another request, and then repeats this process during the measurement
-windows. The number of outstanding inference requests is referred to
-as the *request concurrency*, and so by default perf_analyzer uses a
-request concurrency of 1.
+Triton Performance Analyzer is CLI tool which can help you optimize the
+inference performance of models running on Triton Inference Server by measuring
+changes in performance as you experiment with different optimization strategies.
 
-Using the --concurrency-range \<start\>:\<end\>:\<step\> option you can have
-perf_analyzer collect data for a range of request concurrency
-levels. Use the --help option to see complete documentation for this
-and other options. For example, to see the latency and throughput of
-your model for request concurrency values from 1 to 4:
+<br>
 
-```
-$ perf_analyzer -m inception_graphdef --concurrency-range 1:4
-*** Measurement Settings ***
-  Batch size: 1
-  Measurement window: 5000 msec
-  Latency limit: 0 msec
-  Concurrency limit: 4 concurrent requests
-  Using synchronous calls for inference
-  Stabilizing using average latency
-
-Request concurrency: 1
-  Client:
-    Request count: 339
-    Throughput: 67.8 infer/sec
-    Avg latency: 14710 usec (standard deviation 2539 usec)
-    p50 latency: 13665 usec
-...
-Request concurrency: 4
-  Client:
-    Request count: 415
-    Throughput: 83 infer/sec
-    Avg latency: 48064 usec (standard deviation 6412 usec)
-    p50 latency: 47975 usec
-    p90 latency: 56670 usec
-    p95 latency: 59118 usec
-    p99 latency: 63609 usec
-    Avg HTTP time: 48166 usec (send/recv 264 usec + response wait 47902 usec)
-  Server:
-    Inference count: 498
-    Execution count: 498
-    Successful request count: 498
-    Avg request latency: 45602 usec (overhead 39 usec + queue 33577 usec + compute input 217 usec + compute infer 11753 usec + compute output 16 usec)
-
-Inferences/Second vs. Client Average Batch Latency
-Concurrency: 1, throughput: 67.8 infer/sec, latency 14710 usec
-Concurrency: 2, throughput: 89.8 infer/sec, latency 22280 usec
-Concurrency: 3, throughput: 80.4 infer/sec, latency 37283 usec
-Concurrency: 4, throughput: 83 infer/sec, latency 48064 usec
-```
+# Features
 
-## Understanding The Output
+### Inference Load Modes
 
-### How Throughput is Calculated
+- [Concurrency Mode](docs/inference_load_modes.md#concurrency-mode) simlulates
+  load by maintaining a specific concurrency of outgoing requests to the
+  server
 
-Perf Analyzer calculates throughput to be the total number of requests completed
-during a measurement, divided by the duration of the measurement, in seconds.
+- [Request Rate Mode](docs/inference_load_modes.md#request-rate-mode) simulates
+  load by sending consecutive requests at a specific rate to the server
 
-### How Latency is Calculated
+- [Custom Interval Mode](docs/inference_load_modes.md#custom-interval-mode)
+  simulates load by sending consecutive requests at specific intervals to the
+  server
 
-For each request concurrency level perf_analyzer reports latency and
-throughput as seen from the *client* (that is, as seen by
-perf_analyzer) and also the average request latency on the server.
+### Performance Measurement Modes
 
-The server latency measures the total time from when the request is
-received at the server until the response is sent from the
-server. Because of the HTTP and GRPC libraries used to implement the
-server endpoints, total server latency is typically more accurate for
-HTTP requests as it measures time from first byte received until last
-byte sent. For both HTTP and GRPC the total server latency is
-broken-down into the following components:
+- [Time Windows Mode](docs/measurements_metrics.md#time-windows) measures model
+  performance repeatedly over a specific time interval until performance has
+  stabilized
 
-- *queue*: The average time spent in the inference schedule queue by a
-  request waiting for an instance of the model to become available.
-- *compute*: The average time spent performing the actual inference,
-  including any time needed to copy data to/from the GPU.
+- [Count Windows Mode](docs/measurements_metrics.md#count-windows) measures
+  model performance repeatedly over a specific number of requests until
+  performance has stabilized
 
-The client latency time is broken-down further for HTTP and GRPC as
-follows:
+### Other Features
 
-- HTTP: *send/recv* indicates the time on the client spent sending the
-  request and receiving the response. *response wait* indicates time
-  waiting for the response from the server.
-- GRPC: *(un)marshal request/response* indicates the time spent
-  marshalling the request data into the GRPC protobuf and
-  unmarshalling the response data from the GRPC protobuf. *response
-  wait* indicates time writing the GRPC request to the network,
-  waiting for the response, and reading the GRPC response from the
-  network.
+- [Sequence Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#stateful-models)
+  and
+  [Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models)
+  can be profiled in addition to standard/stateless models
 
-Use the verbose (-v) option to perf_analyzer to see more output,
-including the stabilization passes run for each request concurrency
-level.
+- [Input Data](docs/input_data.md) to model inferences can be auto-generated or
+  specified as well as verifying output
 
-## Measurement Modes
+- [TensorFlow Serving](docs/benchmarking.md#benchmarking-tensorflow-serving) and
+  [TorchServe](docs/benchmarking.md#benchmarking-torchserve) can be used as the
+  inference server in addition to the default Triton server
 
-### Time Windows
+<br>
 
-When using time windows measurement mode (`--measurement-mode=time_windows`),
-Perf Analyzer will count how many requests have completed during a window of
-duration `X` (in milliseconds, via `--measurement-interval=X`, default is
-`5000`). This is the default measurement mode.
+# Quick Start
 
-### Count Windows
+The steps below will guide you on how to start using Perf Analyzer.
 
-When using count windows measurement mode (`--measurement-mode=count_windows`),
-Perf Analyzer will start the window duration at 1 second and potentially
-dynamically increase it until `X` requests have completed (via
-`--measurement-request-count=X`, default is `50`).
+### Step 1: Start Triton Container
 
-## Visualizing Latency vs. Throughput
+```bash
+export RELEASE=<yy.mm> # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02`
 
-The perf_analyzer provides the -f option to generate a file containing
-CSV output of the results.
+docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3
 
+docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3
 ```
-$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv
-$ cat perf.csv
-Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency
-1,69.2,225,2148,64,206,11781,19,0,13891,18795,19753,21018
-3,84.2,237,1768,21673,209,11742,17,0,35398,43984,47085,51701
-4,84.2,279,1604,33669,233,11731,18,1,47045,56545,59225,64886
-2,87.2,235,1973,9151,190,11346,17,0,21874,28557,29768,34766
-```
-
-NOTE: The rows in the CSV file are sorted in an increasing order of throughput (Inferences/Second).
-
-You can import the CSV file into a spreadsheet to help visualize
-the latency vs inferences/second tradeoff as well as see some
-components of the latency. Follow these steps:
-
-- Open [this
-  spreadsheet](https://docs.google.com/spreadsheets/d/1S8h0bWBBElHUoLd2SOvQPzZzRiQ55xjyqodm_9ireiw)
-- Make a copy from the File menu "Make a copy..."
-- Open the copy
-- Select the A1 cell on the "Raw Data" tab
-- From the File menu select "Import..."
-- Select "Upload" and upload the file
-- Select "Replace data at selected cell" and then select the "Import data" button
-
-### Server-side Prometheus metrics
-
-Perf Analyzer can collect
-[server-side metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md#gpu-metrics)
-, such as GPU utilization and GPU power usage. To enable the collection of these metrics,
-use the `--collect-metrics` CLI option.
-
-Perf Analyzer defaults to access the metrics endpoint at
-`localhost:8002/metrics`. If the metrics are accessible at a different url, use
-the `--metrics-url <url>` CLI option to specify that.
-
-Perf Analyzer defaults to access the metrics endpoint every 1000 milliseconds.
-To use a different accessing interval, use the `--metrics-interval <interval>`
-CLI option (specify in milliseconds).
-
-Because Perf Analyzer can collect the server-side metrics multiple times per
-run, these metrics are aggregated in specific ways to produce one final number
-per sweep (concurrency/request rate). Here are how they are aggregated:
 
-| Metric | Aggregation |
-|--------|-------------|
-| GPU Utilization | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. |
-| GPU Power Usage | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. |
-| GPU Used Memory | Maximum from all collections taken during a stable pass. Users are typically curious what the peak memory usage is for determining model/hardware viability. |
-| GPU Total Memory | First from any collection taken during a stable pass. All of the collections should produce the same value for total memory available on the GPU. |
-
-Note that all metrics are per-GPU in the case of multi-GPU systems.
-
-To output these server-side metrics to a CSV file, use the `-f <filename>` and
-`--verbose-csv` CLI options. The output CSV will contain one column per metric.
-The value of each column will be a `key:value` pair (`GPU UUID:metric value`).
-Each `key:value` pair will be delimited by a semicolon (`;`) to indicate metric
-values for each GPU accessible by the server. There is a trailing semicolon. See
-below:
-
-`<gpu-uuid-0>:<metric-value>;<gpu-uuid-1>:<metric-value>;...;`
-
-Here is a simplified CSV output:
+### Step 2: Download `simple` Model
 
 ```bash
-$ perf_analyzer -m resnet50_libtorch --collect-metrics -f output.csv --verbose-csv
-$ cat output.csv
-Concurrency,...,Avg GPU Utilization,Avg GPU Power Usage,Max GPU Memory Usage,Total GPU Memory
-1,...,gpu_uuid_0:0.33;gpu_uuid_1:0.5;,gpu_uuid_0:55.3;gpu_uuid_1:56.9;,gpu_uuid_0:10000;gpu_uuid_1:11000;,gpu_uuid_0:50000;gpu_uuid_1:75000;,
-2,...,gpu_uuid_0:0.25;gpu_uuid_1:0.6;,gpu_uuid_0:25.6;gpu_uuid_1:77.2;,gpu_uuid_0:11000;gpu_uuid_1:17000;,gpu_uuid_0:50000;gpu_uuid_1:75000;,
-3,...,gpu_uuid_0:0.87;gpu_uuid_1:0.9;,gpu_uuid_0:87.1;gpu_uuid_1:71.7;,gpu_uuid_0:15000;gpu_uuid_1:22000;,gpu_uuid_0:50000;gpu_uuid_1:75000;,
-```
-
-## Input Data
-
-Use the --help option to see complete documentation for all input
-data options. By default perf_analyzer sends random data to all the
-inputs of your model. You can select a different input data mode with
-the --input-data option:
-
-- *random*: (default) Send random data for each input.
-- *zero*: Send zeros for each input.
-- directory path: A path to a directory containing a binary file for each input, named the same as the input. Each binary file must contain the data required for that input for a batch-1 request. Each file should contain the raw binary representation of the input in row-major order.
-- file path: A path to a JSON file containing data to be used with every inference request. See the "Real Input Data" section for further details. --input-data can be provided multiple times with different file paths to specific multiple JSON files.
-
-For tensors with with STRING/BYTES datatype there are additional
-options --string-length and --string-data that may be used in some
-cases (see --help for full documentation).
-
-For models that support batching you can use the -b option to indicate
-the batch-size of the requests that perf_analyzer should send. For
-models with variable-sized inputs you must provide the --shape
-argument so that perf_analyzer knows what shape tensors to use. For
-example, for a model that has an input called *IMAGE* that has shape [
-3, N, M ], where N and M are variable-size dimensions, to tell
-perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]:
+# inside triton container
+git clone --depth 1 https://github.com/triton-inference-server/server
 
+mkdir model_repository ; cp -r server/docs/examples/model_repository/simple model_repository
 ```
-$ perf_analyzer -m mymodel -b 4 --shape IMAGE:3,224,224
-```
-
-## Real Input Data
 
-The performance of some models is highly dependent on the data used.
-For such cases you can provide data to be used with every inference
-request made by analyzer in a JSON file. The perf_analyzer will use
-the provided data in a round-robin order when sending inference
-requests. For sequence models, if a sequence length is specified via
-`--sequence-length`, perf_analyzer will also loop through the provided data in a
-round-robin order up to the specified sequence length (with a percentage
-variation customizable via `--sequence-length-variation`). Otherwise, the
-sequence length will be the number of inputs specified in user-provided input 
-data.
-
-Each entry in the "data" array must specify all input tensors with the
-exact size expected by the model from a single batch. The following
-example describes data for a model with inputs named, INPUT0 and
-INPUT1, shape [4, 4] and data type INT32:
-
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        },
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        },
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        },
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        }
-        ...
-      ]
-  }
-```
+### Step 3: Start Triton Server
 
-Note that the [4, 4] tensor has been flattened in a row-major format
-for the inputs. In addition to specifying explicit tensors, you can
-also provide Base64 encoded binary data for the tensors. Each data
-object must list its data in a row-major order. Binary data must be in
-little-endian byte order. The following example highlights how this
-can be acheived:
-
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
-          "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
-        },
-        {
-          "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
-          "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
-        },
-        {
-          "INPUT0" : {"b64": "YmFzZTY0IGRlY29kZXI="},
-          "INPUT1" : {"b64": "YmFzZTY0IGRlY29kZXI="}
-        },
-        ...
-      ]
-  }
-```
+```bash
+# inside triton container
+tritonserver --model-repository $(pwd)/model_repository &> server.log &
 
-In case of sequence models, multiple data streams can be specified in
-the JSON file. Each sequence will get a data stream of its own and the
-analyzer will ensure the data from each stream is played back to the
-same correlation id. The below example highlights how to specify data
-for multiple streams for a sequence model with a single input named
-INPUT, shape [1] and data type STRING:
+# confirm server is ready, look for 'HTTP/1.1 200 OK'
+curl -v localhost:8000/v2/health/ready
 
+# detatch (CTRL-p CTRL-q)
 ```
-  {
-    "data" :
-      [
-        [
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["2"]
-          },
-          {
-            "INPUT" : ["3"]
-          },
-          {
-            "INPUT" : ["4"]
-          }
-        ],
-        [
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["1"]
-          }
-        ],
-        [
-          {
-            "INPUT" : ["1"]
-          },
-          {
-            "INPUT" : ["1"]
-          }
-        ]
-      ]
-  }
-```
-
-The above example describes three data streams with lengths 4, 3 and 2
-respectively.  The perf_analyzer will hence produce sequences of
-length 4, 3 and 2 in this case.
-
-You can also provide an optional "shape" field to the tensors. This is
-especially useful while profiling the models with variable-sized
-tensors as input. Additionally note that when providing the "shape" field,
-tensor contents must be provided separately in "content" field in row-major
-order. The specified shape values will override default input shapes
-provided as a command line option (see --shape) for variable-sized inputs.
-In the absence of "shape" field, the provided defaults will be used. There
-is no need to specify shape as a command line option if all the data steps
-provide shape values for variable tensors. Below is an example json file
-for a model with single input "INPUT", shape [-1,-1] and data type INT32:
 
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-                    "shape": [2,8]
-                }
-        },
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-                    "shape": [8,2]
-                }
-        },
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-                }
-        },
-        {
-          "INPUT" :
-                {
-                    "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-                    "shape": [4,4]
-                }
-        }
-        ...
-      ]
-  }
-```
+### Step 4: Start Triton SDK Container
 
-The following is the example to provide contents as base64 string with explicit shapes:
+```bash
+docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
 
+docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
 ```
-{
-  "data": [{ 
-      "INPUT": {
-                 "content": {"b64": "/9j/4AAQSkZ(...)"},
-                 "shape": [7964]
-               }},
-    (...)]
-}
-```
-
-Note that for STRING type an element is represented by a 4-byte unsigned integer giving
-the length followed by the actual bytes. The byte array to be encoded using base64 must
-include the 4-byte unsigned integers.
-
-### Output Validation
 
-When real input data is provided, it is optional to request perf analyzer to
-validate the inference output for the input data.
+### Step 5: Run Perf Analyzer
 
-Validation output can be specified in "validation_data" field in the same format
-as "data" field for real input. Note that the entries in "validation_data" must
-align with "data" for proper mapping. The following example describes validation
-data for a model with inputs named, INPUT0 and INPUT1, outputs named, OUTPUT0
-and OUTPUT1, all tensors have shape [4, 4] and data type INT32:
-
-```
-  {
-    "data" :
-     [
-        {
-          "INPUT0" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
-          "INPUT1" : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
-        }
-        ...
-      ],
-    "validation_data" :
-     [
-        {
-          "OUTPUT0" : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
-          "OUTPUT1" : [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
-        }
-        ...
-      ]
-  }
+```bash
+# inside sdk container
+perf_analyzer -m simple
 ```
 
-Besides the above example, the validation outputs can be specified in the same
-variations described in "real input data" section.
-
-## Shared Memory
+See the full [quick start guide](docs/quick_start.md) for additional tips on
+how to analyze output.
 
-By default perf_analyzer sends input tensor data and receives output
-tensor data over the network. You can instead instruct perf_analyzer to
-use system shared memory or CUDA shared memory to communicate tensor
-data. By using these options you can model the performance that you
-can achieve by using shared memory in your application. Use
---shared-memory=system to use system (CPU) shared memory or
---shared-memory=cuda to use CUDA shared memory.
+<br>
 
-## Communication Protocol
+# Documentation
 
-By default perf_analyzer uses HTTP to communicate with Triton. The GRPC
-protocol can be specificed with the -i option. If GRPC is selected the
---streaming option can also be specified for GRPC streaming.
+- [Installation](docs/install.md)
+- [Perf Analyzer CLI](docs/cli.md)
+- [Inference Load Modes](docs/inference_load_modes.md)
+- [Input Data](docs/input_data.md)
+- [Measurements & Metrics](docs/measurements_metrics.md)
+- [Benchmarking](docs/benchmarking.md)
 
-### SSL/TLS Support
+<br>
 
-perf_analyzer can be used to benchmark Triton service behind SSL/TLS-enabled endpoints. These options can help in establishing secure connection with the endpoint and profile the server.
+# Contributing
 
-For gRPC, see the following options:
+Contributions to Triton Perf Analyzer are more than welcome. To contribute
+please review the [contribution 
+guidelines](https://github.com/triton-inference-server/server/blob/main/CONTRIBUTING.md),
+then fork and create a pull request.
 
-* `--ssl-grpc-use-ssl`
-* `--ssl-grpc-root-certifications-file`
-* `--ssl-grpc-private-key-file`
-* `--ssl-grpc-certificate-chain-file`
+<br>
 
-More details here: https://grpc.github.io/grpc/cpp/structgrpc_1_1_ssl_credentials_options.html
+# Reporting problems, asking questions
 
-The
-[inference protocol gRPC SSL/TLS section](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#ssltls)
-describes server-side options to configure SSL/TLS in Triton's gRPC endpoint.
+We appreciate any feedback, questions or bug reporting regarding this
+project. When help with code is needed, follow the process outlined in
+the Stack Overflow (https://stackoverflow.com/help/mcve)
+document. Ensure posted examples are:
 
-For HTTPS, the following options are exposed:
+- minimal - use as little code as possible that still produces the
+  same problem
 
-* `--ssl-https-verify-peer`
-* `--ssl-https-verify-host`
-* `--ssl-https-ca-certificates-file`
-* `--ssl-https-client-certificate-file`
-* `--ssl-https-client-certificate-type`
-* `--ssl-https-private-key-file`
-* `--ssl-https-private-key-type`
+- complete - provide all parts needed to reproduce the problem. Check
+  if you can strip external dependency and still show the problem. The
+  less time we spend on reproducing problems the more time we have to
+  fix it
 
-See `--help` for full documentation.
-
-Unlike gRPC, Triton's HTTP server endpoint can not be configured with SSL/TLS support.
-
-Note: Just providing these `--ssl-http-*` options to perf_analyzer does not ensure the SSL/TLS is used in communication. If SSL/TLS is not enabled on the service endpoint, these options have no effect. The intent of exposing these options to a user of perf_analyzer is to allow them to configure perf_analyzer to benchmark Triton service behind SSL/TLS-enabled endpoints. In other words, if Triton is running behind a HTTPS server proxy, then these options would allow perf_analyzer to profile Triton via exposed HTTPS proxy.
-
-## Benchmarking Triton directly via C API
-
-Besides using HTTP or gRPC server endpoints to communicate with Triton, perf_analyzer also allows user to benchmark Triton directly using C API. HTTP/gRPC endpoints introduce an additional latency in the pipeline which may not be of interest to the user who is using Triton via C API within their application. Specifically, this feature is useful to benchmark bare minimum Triton without additional overheads from HTTP/gRPC communication.
-
-### Prerequisite
-Pull the Triton SDK and the Inference Server container images on target machine.
-Since you will need access to the Tritonserver install, it might be easier if 
-you copy the perf_analyzer binary to the Inference Server container.
-
-### Required Parameters
-Use the --help option to see complete list of supported command line arguments.
-By default perf_analyzer expects the Triton instance to already be running. You can configure the C API mode using the `--service-kind` option. In additon, you will need to point
-perf_analyzer to the Triton server library path using the `--triton-server-directory` option and the model 
-repository path using the `--model-repository` option.
-If the server is run successfully, there is a prompt: "server is alive!" and perf_analyzer will print the stats, as normal.
-An example run would look like:
-```
-perf_analyzer -m graphdef_int32_int32_int32 --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/workspace/qa/L0_perf_analyzer_capi/models
-```
-
-### Non-supported functionalities
-There are a few functionalities that are missing from the C API. They are:
-1. Async mode (`-a`)
-2. Using shared memory mode (`--shared-memory=cuda` or `--shared-memory=system`)
-3. Request rate range mode
-4. For additonal known non-working cases, please refer to 
-   [qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277)
-
-
-## Benchmarking TensorFlow Serving
-perf_analyzer can also be used to benchmark models deployed on
-[TensorFlow Serving](https://github.com/tensorflow/serving) using
-the `--service-kind` option. The support is however only available
-through gRPC protocol.
- 
-Following invocation demonstrates how to configure perf_analyzer
-to issue requests to a running instance of
-`tensorflow_model_server`:
- 
-```
-$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500
-*** Measurement Settings ***
-  Batch size: 1
-  Using "time_windows" mode for stabilization
-  Measurement window: 5000 msec
-  Using synchronous calls for inference
-  Stabilizing using average latency
-Request concurrency: 1
-  Client: 
-    Request count: 829
-    Throughput: 165.8 infer/sec
-    Avg latency: 6032 usec (standard deviation 569 usec)
-    p50 latency: 5863 usec
-    p90 latency: 6655 usec
-    p95 latency: 6974 usec
-    p99 latency: 8093 usec
-    Avg gRPC time: 5984 usec ((un)marshal request/response 257 usec + response wait 5727 usec)
-Inferences/Second vs. Client Average Batch Latency
-Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec
-```
- 
-You might have to specify a different url(`-u`) to access wherever
-the server is running. The report of perf_analyzer will only
-include statistics measured at the client-side.
- 
-**NOTE:** The support is still in **beta**. perf_analyzer does
-not guarantee optimum tuning for TensorFlow Serving. However, a
-single benchmarking tool that can be used to stress the inference
-servers in an identical manner is important for performance
-analysis.
-
- 
-The following points are important for interpreting the results:
-1. `Concurrent Request Execution`:
-TensorFlow Serving (TFS), as of version 2.8.0, by default creates
-threads for each request that individually submits requests to
-TensorFlow Session. There is a resource limit on the number of
-concurrent threads serving requests. When benchmarking at a higher
-request concurrency, you can see higher throughput because of this.  
-Unlike TFS, by default Triton is configured with only a single
-[instance count](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups)
-. Hence, at a higher request concurrency, most
-of the requests are blocked on the instance availability. To
-configure Triton to behave like TFS, set the instance count to a
-reasonably high value and then set
-[MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters)
-parameter in the model confib.pbtxt to the same value.For some
-context, the TFS sets its thread constraint to four times the
-num of schedulable CPUs.
-2. `Different library versions`:
-The version of TensorFlow might differ between Triton and
-TensorFlow Serving being benchmarked. Even the versions of cuda
-libraries might differ between the two solutions. The performance
-of models can be susceptible to the versions of these libraries.
-For a single request concurrency, if the compute_infer time
-reported by perf_analyzer when benchmarking Triton is as large as
-the latency reported by perf_analyzer when benchmarking TFS, then
-the performance difference is likely because of the difference in
-the software stack and outside the scope of Triton.
-3. `CPU Optimization`:
-TFS has separate builds for CPU and GPU targets. They have
-target-specific optimization. Unlike TFS, Triton has a single build
-which is optimized for execution on GPUs. When collecting performance
-on CPU models on Triton, try running Triton with the environment
-variable `TF_ENABLE_ONEDNN_OPTS=1`.
- 
- 
-## Benchmarking TorchServe
-perf_analyzer can also be used to benchmark
-[TorchServe](https://github.com/pytorch/serve) using the
-`--service-kind` option. The support is however only available through
-HTTP protocol. It also requires input to be provided via JSON file.
- 
-Following invocation demonstrates how to configure perf_analyzer to
-issue requests to a running instance of `torchserve` assuming the
-location holds `kitten_small.jpg`:
- 
-```
-$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json
- Successfully read data for 1 stream/streams with 1 step/steps.
-*** Measurement Settings ***
-  Batch size: 1
-  Using "time_windows" mode for stabilization
-  Measurement window: 5000 msec
-  Using synchronous calls for inference
-  Stabilizing using average latency
-Request concurrency: 1
-  Client: 
-    Request count: 799
-    Throughput: 159.8 infer/sec
-    Avg latency: 6259 usec (standard deviation 397 usec)
-    p50 latency: 6305 usec
-    p90 latency: 6448 usec
-    p95 latency: 6494 usec
-    p99 latency: 7158 usec
-    Avg HTTP time: 6272 usec (send/recv 77 usec + response wait 6195 usec)
-Inferences/Second vs. Client Average Batch Latency
-Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec
-```
- 
-The content of `data.json`:
- 
-```
- {
-   "data" :
-    [
-       {
-         "TORCHSERVE_INPUT" : ["kitten_small.jpg"]
-       }
-     ]
- }
-```
- 
-You might have to specify a different url(`-u`) to access wherever
-the server is running. The report of perf_analyzer will only include
-statistics measured at the client-side.
- 
-**NOTE:** The support is still in **beta**. perf_analyzer does not
-guarantee optimum tuning for TorchServe. However, a single benchmarking
-tool that can be used to stress the inference servers in an identical
-manner is important for performance analysis.
-
-## Advantages of using Perf Analyzer over third-party benchmark suites
-
-Triton Inference Server offers the entire serving solution which
-includes [client libraries](https://github.com/triton-inference-server/client)
-that are optimized for Triton.
-Using third-party benchmark suites like jmeter fails to take advantage of the
-optimized libraries. Some of these optimizations includes but are not limited
-to:
-1. Using
-[binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md)
-with HTTP requests.
-2. Effective re-use of gRPC message allocation in subsequent requests.
-3. Avoiding extra memory copy via libcurl interface.
-
-These optimizations can have a tremendous impact on overall performance.
-Using perf_analyzer for benchmarking directly allows a user to access
-these optimizations in their study. 
-
-Not only that, perf_analyzer is also very customizable and supports many
-Triton features as described in this document. This, along with a detailed
-report, allows a user to identify performance bottlenecks and experiment
-with different features before deciding upon what works best for them.
+- verifiable - test the code you're about to provide to make sure it
+  reproduces the problem. Remove all other problems that are not
+  related to your request/question.
diff --git a/src/c++/perf_analyzer/command_line_parser.cc b/src/c++/perf_analyzer/command_line_parser.cc
index 5ca4fc9f6..4a226c1e1 100644
--- a/src/c++/perf_analyzer/command_line_parser.cc
+++ b/src/c++/perf_analyzer/command_line_parser.cc
@@ -168,8 +168,8 @@ CLParser::Usage(const std::string& msg)
              "{\"data\" : [{\"TORCHSERVE_INPUT\" : [\"<complete path to the "
              "content file>\"]}, {...}...]}. The type of file here will depend "
              "on the model. In order to use \"triton_c_api\" you must specify "
-             "the Triton server install path and the model repository "
-             "path via the --library-name and --model-repo flags",
+             "the Triton server install path and the model repository path via "
+             "the --triton-server-directory and --model-repository flags",
              18)
       << std::endl;
 
diff --git a/src/c++/perf_analyzer/docs/README.md b/src/c++/perf_analyzer/docs/README.md
new file mode 100644
index 000000000..485c4207f
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/README.md
@@ -0,0 +1,54 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# **Perf Analyzer Documentation**
+
+| [Installation](README.md#installation) | [Getting Started](README.md#getting-started) | [User Guide](README.md#user-guide) |
+| -------------------------------------- | -------------------------------------------- | ---------------------------------- |
+
+## **Installation**
+
+See the [Installation Guide](install.md) for details on how to install Perf
+Analyzer.
+
+## **Getting Started**
+
+The [Quick Start Guide](quick_start.md) will show you how to use Perf
+Analyzer to profile a simple PyTorch model.
+
+## **User Guide**
+
+The User Guide describes the Perf Analyzer command line options, how to specify
+model input data, the performance measurement modes, the performance metrics and
+outputs, how to benchmark different servers, and more.
+
+- [Perf Analyzer CLI](cli.md)
+- [Inference Load Modes](inference_load_modes.md)
+- [Input Data](input_data.md)
+- [Measurements & Metrics](measurements_metrics.md)
+- [Benchmarking](benchmarking.md)
diff --git a/src/c++/perf_analyzer/docs/benchmarking.md b/src/c++/perf_analyzer/docs/benchmarking.md
new file mode 100644
index 000000000..d900ff852
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/benchmarking.md
@@ -0,0 +1,250 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Benchmarking Triton via HTTP or gRPC endpoint
+
+This is the default mode for Perf Analyzer.
+
+# Benchmarking Triton directly via C API
+
+Besides using HTTP or gRPC server endpoints to communicate with Triton, Perf
+Analyzer also allows users to benchmark Triton directly using the C API. HTTP
+and gRPC endpoints introduce an additional latency in the pipeline which may not
+be of interest to users who are using Triton via C API within their application.
+Specifically, this feature is useful to benchmark a bare minimum Triton without
+additional overheads from HTTP/gRPC communication.
+
+## Prerequisite
+
+Pull the Triton SDK and the Triton Server container images on target machine.
+Since you will need access to the `tritonserver` install, it might be easier if
+you copy the `perf_analyzer` binary to the Inference Server container.
+
+## Required parameters
+
+Use the `--help` option to see a complete list of supported command line
+arguments. By default, Perf Analyzer expects the Triton instance to already be
+running. You can configure C API mode using the
+[`--service-kind`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
+option. In additon, you will need to point Perf Analyzer to the Triton server
+library path using the
+[`--triton-server-directory`](cli.md#--triton-server-directorypath) option and
+the model repository path using the
+[`--model-repository`](cli.md#--model-repositorypath) option.
+
+An example run would look like:
+
+```
+$ perf_analyzer -m my_model --service-kind=triton_c_api --triton-server-directory=/opt/tritonserver --model-repository=/my/model/repository
+...
+*** Measurement Settings ***
+  Service Kind: Triton C-API
+  Using "time_windows" mode for stabilization
+  Measurement window: 5000 msec
+  Using synchronous calls for inference
+  Stabilizing using average latency
+
+Request concurrency: 1
+  Client: 
+    Request count: 353
+    Throughput: 19.6095 infer/sec
+    Avg latency: 50951 usec (standard deviation 2265 usec)
+    p50 latency: 50833 usec
+    p90 latency: 50923 usec
+    p95 latency: 50940 usec
+    p99 latency: 50985 usec
+    
+  Server: 
+    Inference count: 353
+    Execution count: 353
+    Successful request count: 353
+    Avg request latency: 50841 usec (overhead 20 usec + queue 63 usec + compute input 35 usec + compute infer 50663 usec + compute output 59 usec)
+
+Inferences/Second vs. Client Average Batch Latency
+Concurrency: 1, throughput: 19.6095 infer/sec, latency 50951 usec
+```
+
+## Non-supported functionalities
+
+There are a few functionalities that are missing from C API mode. They are:
+
+1. Async mode ([`--async`](cli.md#--async))
+2. For additonal known non-working cases, please refer to
+   [qa/L0_perf_analyzer_capi/test.sh](https://github.com/triton-inference-server/server/blob/main/qa/L0_perf_analyzer_capi/test.sh#L239-L277)
+
+# Benchmarking TensorFlow Serving
+
+Perf Analyzer can also be used to benchmark models deployed on
+[TensorFlow Serving](https://github.com/tensorflow/serving) using the
+[`--service-kind=tfserving`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
+option. Only gRPC protocol is supported.
+
+The following invocation demonstrates how to configure Perf Analyzer to issue
+requests to a running instance of `tensorflow_model_server`:
+
+```
+$ perf_analyzer -m resnet50 --service-kind tfserving -i grpc -b 1 -p 5000 -u localhost:8500
+*** Measurement Settings ***
+  Batch size: 1
+  Using "time_windows" mode for stabilization
+  Measurement window: 5000 msec
+  Using synchronous calls for inference
+  Stabilizing using average latency
+Request concurrency: 1
+  Client:
+    Request count: 829
+    Throughput: 165.8 infer/sec
+    Avg latency: 6032 usec (standard deviation 569 usec)
+    p50 latency: 5863 usec
+    p90 latency: 6655 usec
+    p95 latency: 6974 usec
+    p99 latency: 8093 usec
+    Avg gRPC time: 5984 usec ((un)marshal request/response 257 usec + response wait 5727 usec)
+Inferences/Second vs. Client Average Batch Latency
+Concurrency: 1, throughput: 165.8 infer/sec, latency 6032 usec
+```
+
+You might have to specify a different url ([`-u`](cli.md#-u-url)) to access
+wherever the server is running. The report of Perf Analyzer will only include
+statistics measured at the client-side.
+
+**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee
+optimal tuning for TensorFlow Serving. However, a single benchmarking tool that
+can be used to stress the inference servers in an identical manner is important
+for performance analysis.
+
+The following points are important for interpreting the results:
+
+1. `Concurrent Request Execution`:
+   TensorFlow Serving (TFS), as of version 2.8.0, by default creates threads for
+   each request that individually submits requests to TensorFlow Session. There
+   is a resource limit on the number of concurrent threads serving requests.
+   When benchmarking at a higher request concurrency, you can see higher
+   throughput because of this. Unlike TFS, by default Triton is configured with
+   only a single
+   [instance count](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups).
+   Hence, at a higher request concurrency, most of the requests are blocked on
+   the instance availability. To configure Triton to behave like TFS, set the
+   instance count to a reasonably high value and then set
+   [MAX_SESSION_SHARE_COUNT](https://github.com/triton-inference-server/tensorflow_backend#parameters)
+   parameter in the model `config.pbtxt` to the same value. For some context,
+   the TFS sets its thread constraint to four times the num of schedulable CPUs.
+2. `Different library versions`:
+   The version of TensorFlow might differ between Triton and TensorFlow Serving
+   being benchmarked. Even the versions of CUDA libraries might differ between
+   the two solutions. The performance of models can be susceptible to the
+   versions of these libraries. For a single request concurrency, if the
+   `compute_infer` time reported by Perf Analyzer when benchmarking Triton is as
+   large as the latency reported by Perf Analyzer when benchmarking TFS, then
+   the performance difference is likely because of the difference in the
+   software stack and outside the scope of Triton.
+3. `CPU Optimization`:
+   TFS has separate builds for CPU and GPU targets. They have target-specific
+   optimization. Unlike TFS, Triton has a single build which is optimized for
+   execution on GPUs. When collecting performance on CPU models on Triton, try
+   running Triton with the environment variable `TF_ENABLE_ONEDNN_OPTS=1`.
+
+# Benchmarking TorchServe
+
+Perf Analyzer can also be used to benchmark
+[TorchServe](https://github.com/pytorch/serve) using the
+[`--service-kind=torchserve`](cli.md#--service-kindtritontriton_c_apitfservingtorchserve)
+option. Only HTTP protocol is supported. It also requires input to be provided
+via JSON file.
+
+The following invocation demonstrates how to configure Perf Analyzer to issue
+requests to a running instance of `torchserve` assuming the location holds
+`kitten_small.jpg`:
+
+```
+$ perf_analyzer -m resnet50 --service-kind torchserve -i http -u localhost:8080 -b 1 -p 5000 --input-data data.json
+ Successfully read data for 1 stream/streams with 1 step/steps.
+*** Measurement Settings ***
+  Batch size: 1
+  Using "time_windows" mode for stabilization
+  Measurement window: 5000 msec
+  Using synchronous calls for inference
+  Stabilizing using average latency
+Request concurrency: 1
+  Client:
+    Request count: 799
+    Throughput: 159.8 infer/sec
+    Avg latency: 6259 usec (standard deviation 397 usec)
+    p50 latency: 6305 usec
+    p90 latency: 6448 usec
+    p95 latency: 6494 usec
+    p99 latency: 7158 usec
+    Avg HTTP time: 6272 usec (send/recv 77 usec + response wait 6195 usec)
+Inferences/Second vs. Client Average Batch Latency
+Concurrency: 1, throughput: 159.8 infer/sec, latency 6259 usec
+```
+
+The content of `data.json`:
+
+```json
+ {
+   "data" :
+    [
+       {
+         "TORCHSERVE_INPUT" : ["kitten_small.jpg"]
+       }
+     ]
+ }
+```
+
+You might have to specify a different url ([`-u`](cli.md#-u-url)) to access
+wherever the server is running. The report of Perf Analyzer will only include
+statistics measured at the client-side.
+
+**NOTE:** The support is still in **beta**. Perf Analyzer does not guarantee
+optimal tuning for TorchServe. However, a single benchmarking tool that can be
+used to stress the inference servers in an identical manner is important for
+performance analysis.
+
+# Advantages of using Perf Analyzer over third-party benchmark suites
+
+Triton Inference Server offers the entire serving solution which includes
+[client libraries](https://github.com/triton-inference-server/client) that are
+optimized for Triton. Using third-party benchmark suites like `jmeter` fails to
+take advantage of the optimized libraries. Some of these optimizations includes
+but are not limited to:
+
+1. Using
+   [binary tensor data extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md#binary-tensor-data-extension)
+   with HTTP requests.
+2. Effective re-use of gRPC message allocation in subsequent requests.
+3. Avoiding extra memory copy via libcurl interface.
+
+These optimizations can have a tremendous impact on overall performance. Using
+Perf Analyzer for benchmarking directly allows a user to access these
+optimizations in their study.
+
+Not only that, Perf Analyzer is also very customizable and supports many Triton
+features as described in this document. This, along with a detailed report,
+allows a user to identify performance bottlenecks and experiment with different
+features before deciding upon what works best for them.
diff --git a/src/c++/perf_analyzer/docs/cli.md b/src/c++/perf_analyzer/docs/cli.md
new file mode 100644
index 000000000..19844514e
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/cli.md
@@ -0,0 +1,601 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Perf Analyzer CLI
+
+This document details the Perf Analyzer command line interface:
+
+- [General Options](#general-options)
+- [Measurement Options](#measurement-options)
+- [Sequence Model Options](#sequence-model-options)
+- [Input Data Options](#input-data-options)
+- [Request Options](#request-options)
+- [Server Options](#server-options)
+- [Prometheus Metrics Options](#prometheus-metrics-options)
+- [Report Options](#report-options)
+- [Trace Options](#trace-options)
+- [Deprecated Options](#deprecated-options)
+
+## General Options
+
+#### `-?`
+#### `-h`
+#### `--help`
+
+Prints a description of the Perf Analyzer command line interface.
+
+#### `-m <string>`
+
+Specifies the model name for Perf Analyzer to run.
+
+This is a required option.
+
+#### `-x <string>`
+
+Specifies the version of the model to be used. If not specified the most
+recent version (the highest numbered version) of the model will be used.
+
+#### `--service-kind=[triton|triton_c_api|tfserving|torchserve]`
+
+Specifies the kind of service for Perf Analyzer to generate load for. Note: in
+order to use `torchserve` backend, the `--input-data` option must point to a
+JSON file holding data in the following format:
+
+```
+{
+  "data": [
+    {
+      "TORCHSERVE_INPUT": [
+        "<complete path to the content file>"
+      ]
+    },
+    {...},
+    ...
+  ]
+}
+```
+
+The type of file here will depend on the model. In order to use `triton_c_api`
+you must specify the Triton server install path and the model repository path
+via the `--triton-server-directory` and `--model-repository` options.
+
+Default is `triton`.
+
+#### `--bls-composing-models=<string>`
+
+Specifies the list of all BLS composing models as a comma separated list of 
+model names (with optional model version number after a colon for each) that may
+be called by the input BLS model. For example,
+`--bls-composing-models=modelA:3,modelB` would specify that modelA and modelB
+are composing models that may be called by the input BLS model, and that modelA
+will use version 3, while modelB's version is unspecified.
+
+#### `--model-signature-name=<string>`
+
+Specifies the signature name of the saved model to use.
+
+Default is `serving_default`. This option will be ignored if `--service-kind`
+is not `tfserving`.
+
+#### `-v`
+
+Enables verbose mode. May be specified an additional time (`-v -v`) to enable
+extra verbose mode.
+
+## Measurement Options
+
+#### `--measurement-mode=[time_windows|count_windows]`
+
+Specifies the mode used for stabilizing measurements. 'time_windows' will
+create windows such that the duration of each window is equal to
+`--measurement-interval`. 'count_windows' will create windows such that there
+are at least `--measurement-request-count` requests in each window and that
+the window is at least one second in duration (adding more requests if
+necessary).
+
+Default is `time_windows`.
+
+#### `-p <n>`
+#### `--measurement-interval=<n>`
+
+Specifies the time interval used for each measurement in milliseconds when
+`--measurement-mode=time_windows` is used. Perf Analyzer will sample a time
+interval specified by this option and take measurement over the requests
+completed within that time interval.
+
+Default is `5000`.
+
+#### `--measurement-request-count=<n>`
+
+Specifies the minimum number of requests to be collected in each measurement
+window when `--measurement-mode=count_windows` is used.
+
+Default is `50`.
+
+#### `-s <n>`
+#### `--stability-percentage=<n>`
+
+Specifies the allowed variation in latency measurements when determining if a
+result is stable. The measurement is considered stable if the ratio of max /
+min from the recent 3 measurements is within (stability percentage)% in terms
+of both inferences per second and latency.
+
+Default is `10`(%).
+
+#### `--percentile=<n>`
+
+Specifies the confidence value as a percentile that will be used to determine
+if a measurement is stable. For example, a value of `85` indicates that the
+85th percentile latency will be used to determine stability. The percentile
+will also be reported in the results.
+
+Default is `-1` indicating that the average latency is used to determine
+stability.
+
+#### `-r <n>`
+#### `--max-trials=<n>`
+
+Specifies the maximum number of measurements when attempting to reach stability
+of inferences per second and latency for each concurrency or request rate
+during the search. Perf Analyzer will terminate if the measurement is still
+unstable after the maximum number of trials.
+
+Default is `10`.
+
+#### `--concurrency-range=<start:end:step>`
+
+Specifies the range of concurrency levels covered by Perf Analyzer. Perf
+Analyzer will start from the concurrency level of 'start' and go until 'end'
+with a stride of 'step'.
+
+Default of 'end' and 'step' are `1`. If 'end' is not specified then Perf
+Analyzer will run for a single concurrency level determined by 'start'. If
+'end' is set as `0`, then the concurrency limit will be incremented by 'step'
+until the latency threshold is met. 'end' and `--latency-threshold` cannot
+both be `0`. 'end' cannot be `0` for sequence models while using asynchronous
+mode.
+
+#### `--request-rate-range=<start:end:step>`
+
+Specifies the range of request rates for load generated by Perf Analyzer. This
+option can take floating-point values. The search along the request rate range
+is enabled only when using this option.
+
+If not specified, then Perf Analyzer will search along the concurrency range.
+Perf Analyzer will start from the request rate of 'start' and go until 'end'
+with a stride of 'step'. Default values of 'start', 'end' and 'step' are all
+`1.0`. If 'end' is not specified, then Perf Analyzer will run for a single
+request rate as determined by 'start'. If 'end' is set as `0.0`, then the
+request rate will be incremented by 'step' until the latency threshold is met.
+'end' and `--latency-threshold` can not be both `0`.
+
+#### `--request-distribution=[constant|poisson]`
+
+Specifies the time interval distribution between dispatching inference requests
+to the server. Poisson distribution closely mimics the real-world work load on
+a server. This option is ignored if not using `--request-rate-range`.
+
+Default is `constant`.
+
+#### `-l <n>`
+#### `--latency-threshold=<n>`
+
+Specifies the limit on the observed latency, in milliseconds. Perf Analyzer
+will terminate the concurrency or request rate search once the measured latency
+exceeds this threshold.
+
+Default is `0` indicating that Perf Analyzer will run for the entire
+concurrency or request rate range.
+
+#### `--binary-search`
+
+Enables binary search on the specified search range (concurrency or request
+rate). This option requires 'start' and 'end' to be expilicitly specified in
+the concurrency range or request rate range. When using this option, 'step' is
+more like the precision. When the 'step' is lower, there are more iterations
+along the search path to find suitable convergence.
+
+When `--binary-search` is not specified, linear search is used.
+
+#### `--request-intervals=<path>`
+
+Specifies a path to a file containing time intervals in microseconds. Each time
+interval should be in a new line. Perf Analyzer will try to maintain time
+intervals between successive generated requests to be as close as possible in
+this file. This option can be used to apply custom load to server with a
+certain pattern of interest. Perf Analyzer will loop around the file if the
+duration of execution exceeds the amount of time specified by the intervals.
+This option can not be used with `--request-rate-range` or
+`--concurrency-range`.
+
+#### `--max-threads=<n>`
+
+Specifies the maximum number of threads that will be created for providing
+desired concurrency or request rate. However, when running in synchronous mode
+with `--concurrency-range` having explicit 'end' specification, this value will
+be ignored.
+
+Default is `4` if `--request-rate-range` is specified, otherwise default is
+`16`.
+
+## Sequence Model Options
+
+#### `--num-of-sequences=<n>`
+
+Specifies the number of concurrent sequences for sequence models. This option
+is ignored when `--request-rate-range` is not specified.
+
+Default is `4`.
+
+#### `--sequence-length=<n>`
+
+Specifies the base length of a sequence used for sequence models. A sequence
+with length X will be composed of X requests to be sent as the elements in the
+sequence. The actual length of the sequencewill be within +/- Y% of the base
+length, where Y defaults to 20% and is customizable via
+`--sequence-length-variation`. If sequence length is unspecified and input data
+is provided, the sequence length will be the number of inputs in the
+user-provided input data.
+
+Default is `20`.
+
+#### `--sequence-length-variation=<n>`
+
+Specifies the percentage variation in length of sequences. This option is only
+valid when not using user-provided input data or when `--sequence-length` is
+specified while using user-provided input data.
+
+Default is `20`(%).
+
+#### `--sequence-id-range=<start:end>`
+
+Specifies the range of sequence IDs used by Perf Analyzer. Perf Analyzer will
+start from the sequence ID of 'start' and go until 'end' (excluded). If 'end'
+is not specified then Perf Analyzer will generate new sequence IDs without
+bounds. If 'end' is specified and the concurrency setting may result in
+maintaining a number of sequences more than the range of available sequence
+IDs, Perf Analyzer will exit with an error due to possible sequence ID
+collisions.
+
+The default for 'start is `1`, and 'end' is not specified (no bounds).
+
+## Input Data Options
+
+#### `--input-data=[zero|random|<path>]`
+
+Specifies type of data that will be used for input in inference requests. The
+available options are `zero`, `random`, and a path to a directory or a JSON
+file.
+
+When pointing to a JSON file, the user must adhere to the format described in
+the [input data documentation](input_data.md). By specifying JSON data, users
+can control data used with every request. Multiple data streams can be specified
+for a sequence model, and Perf Analyzer will select a data stream in a
+round-robin fashion for every new sequence. Muliple JSON files can also be
+provided (`--input-data json_file1.json --input-data json_file2.json` and so on)
+and Perf Analyzer will append data streams from each file. When using
+`--service-kind=torchserve`, make sure this option points to a JSON file.
+
+If the option is path to a directory then the directory must contain a binary
+text file for each non-string/string input respectively, named the same as the
+input. Each file must contain the data required for that input for a batch-1
+request. Each binary file should contain the raw binary representation of the 
+input in row-major order for non-string inputs. The text file should contain
+all strings needed by batch-1, each in a new line, listed in row-major order.
+
+Default is `random`.
+
+#### `-b <n>`
+
+Specifies the batch size for each request sent.
+
+Default is `1`.
+
+#### `--shape=<string>`
+
+Specifies the shape used for the specified input. The argument must be
+specified as 'name:shape' where the shape is a comma-separated list for
+dimension sizes. For example `--shape=input_name:1,2,3` indicates that the
+input `input_name` has tensor shape [ 1, 2, 3 ]. `--shape` may be specified
+multiple times to specify shapes for different inputs.
+
+#### `--string-data=<string>`
+
+Specifies the string to initialize string input buffers. Perf Analyzer will
+replicate the given string to build tensors of required shape.
+`--string-length` will not have any effect. This option is ignored if
+`--input-data` points to a JSON file or directory.
+
+#### `--string-length=<n>`
+
+Specifies the length of the random strings to be generated by Perf Analyzer
+for string input. This option is ignored if `--input-data` points to a
+JSON file or directory.
+
+Default is `128`.
+
+#### `--shared-memory=[none|system|cuda]`
+
+Specifies the type of the shared memory to use for input and output data.
+         
+Default is `none`.
+
+#### `--output-shared-memory-size=<n>`
+
+Specifies The size, in bytes, of the shared memory region to allocate per
+output tensor. Only needed when one or more of the outputs are of string type
+and/or variable shape. The value should be larger than the size of the largest
+output tensor that the model is expected to return. Perf Analyzer will use the
+following formula to calculate the total shared memory to allocate:
+output_shared_memory_size * number_of_outputs * batch_size.
+         
+Default is `102400` (100 KB).
+
+## Request Options
+
+#### `-i [http|grpc]`
+
+Specifies the communication protocol to use. The available protocols are gRPC
+and HTTP.
+
+Default is `http`.
+
+#### `-a`
+#### `--async`
+
+Enables asynchronous mode in Perf Analyzer.
+
+By default, Perf Analyzer will use a synchronous request API for inference.
+However, if the model is sequential, then the default mode is asynchronous.
+Specify `--sync` to operate sequential models in synchronous mode. In
+synchronous mode, Perf Analyzer will start threads equal to the concurrency
+level. Use asynchronous mode to limit the number of threads, yet maintain the
+concurrency.
+
+#### `--sync`
+
+Enables synchronous mode in Perf Analyzer. Can be used to operate Perf
+Analyzer with sequential model in synchronous mode.
+
+#### `--streaming`
+
+Enables the use of streaming API. This option is only valid with gRPC protocol.
+
+Default is `false`.
+
+#### `-H <string>`
+
+Specifies the header that will be added to HTTP requests (ignored for gRPC
+requests). The header must be specified as 'Header:Value'. `-H` may be
+specified multiple times to add multiple headers.
+
+#### `--grpc-compression-algorithm=[none|gzip|deflate]`
+
+Specifies the compression algorithm to be used by gRPC when sending requests.
+Only supported when gRPC protocol is being used.
+
+Default is `none`.
+
+## Server Options
+
+#### `-u <url>`
+
+Specifies the URL for the server.
+
+Default is `localhost:8000` when using `--service-kind=triton` with HTTP.
+Default is `localhost:8001` when using `--service-kind=triton` with gRPC.
+Default is `localhost:8500` when using `--service-kind=tfserving`.
+
+#### `--ssl-grpc-use-ssl`
+
+Enables usage of an encrypted channel to the server.
+
+#### `--ssl-grpc-root-certifications-file=<path>`
+
+Specifies the path to file containing the PEM encoding of the server root
+certificates.
+
+#### `--ssl-grpc-private-key-file=<path>`
+
+Specifies the path to file containing the PEM encoding of the client's private
+key.
+
+#### `--ssl-grpc-certificate-chain-file=<path>`
+
+Specifies the path to file containing the PEM encoding of the client's
+certificate chain.
+
+#### `--ssl-https-verify-peer=[0|1]`
+
+Specifies whether to verify the peer's SSL certificate. See
+https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html for the meaning of each
+value.
+
+Default is `1`.
+
+#### `--ssl-https-verify-host=[0|1|2]`
+
+Specifies whether to verify the certificate's name against host. See
+https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html for the meaning of each
+value.
+
+Default is `2`.
+
+#### `--ssl-https-ca-certificates-file=<path>`
+
+Specifies the path to Certificate Authority (CA) bundle.
+
+#### `--ssl-https-client-certificate-file=<path>`
+
+Specifies the path to the SSL client certificate.
+
+#### `--ssl-https-client-certificate-type=[PEM|DER]`
+
+Specifies the type of the client SSL certificate.
+
+Default is `PEM`.
+
+#### `--ssl-https-private-key-file=<path>`
+
+Specifies the path to the private keyfile for TLS and SSL client cert.
+
+#### `--ssl-https-private-key-type=[PEM|DER]`
+
+Specifies the type of the private key file.
+
+Default is `PEM`.
+
+#### `--triton-server-directory=<path>`
+
+Specifies the Triton server install path. Required by and only used when C API
+is used (`--service-kind=triton_c_api`).
+
+Default is `/opt/tritonserver`.
+
+#### `--model-repository=<path>`
+
+Specifies the model repository directory path for loading models. Required by
+and only used when C API is used (`--service-kind=triton_c_api`).
+
+## Prometheus Metrics Options
+
+#### `--collect-metrics`
+
+Enables the collection of server-side inference server metrics. Perf Analyzer
+will output metrics in the CSV file generated with the `-f` option. Only valid
+when `--verbose-csv` option also used.
+
+#### `--metrics-url=<url>`
+
+Specifies the URL to query for server-side inference server metrics.
+
+Default is `localhost:8002/metrics`.
+
+#### `--metrics-interval=<n>`
+
+Specifies how often within each measurement window, in milliseconds, Perf
+Analyzer should query for server-side inference server metrics.
+
+Default is `1000`.
+
+## Report Options
+
+#### `-f <path>`
+
+Specifies the path that the latency report file will be generated at.
+
+When `-f` is not specified, a latency report will not be generated.
+
+#### `--verbose-csv`
+
+Enables additional information being output to the CSV file generated by Perf
+Analyzer.
+
+## Trace Options
+
+#### `--trace-file=<path>`
+
+Specifies the file where trace output will be saved.
+
+If `--trace-log-frequency` is also specified, this argument value will be the
+prefix of the files to save the trace output. See `--trace-log-frequency` for
+details. Only used for `--service-kind=triton`.
+
+#### `--trace-level=[OFF|TIMESTAMPS|TENSORS]`
+
+Specifies a trace level. `OFF` disables tracing. `TIMESTAMPS` traces
+timestamps. `TENSORS` traces tensors. It may be specified multiple times to
+trace multiple informations.
+
+Default is `OFF`.
+
+#### `--trace-rate=<n>`
+
+Specifies the trace sampling rate (traces per second).
+
+Default is `1000`.
+
+#### `--trace-count=<n>`
+
+Specifies the number of traces to be sampled. If the value is `-1`, the number
+of traces to be sampled will not be limited.
+
+Default is `-1`.
+
+#### `--log-frequency=<n>`
+
+Specifies the trace log frequency. If the value is `0`, Triton will only log
+the trace output to path specified via `--trace-file` when shutting down.
+Otherwise, Triton will log the trace output to the path specified via
+`--trace-file`.<idx> when it collects the specified number of traces. For
+example, if `--trace-file` is specified to be `trace_file.log`, and if the log
+frequency is `100`, when Triton collects the 100th trace, it logs the traces
+to file `trace_file.log.0`, and when it collects the 200th trace, it logs the
+101st to the 200th traces to file `trace_file.log.1`.
+
+Default is `0`.
+
+## Deprecated Options
+
+#### `--data-directory=<path>`
+
+**DEPRECATED**
+
+Alias for `--input-data=<path>` where `<path>` is the path to a directory. See
+`--input-data` option documentation for details.
+
+#### `-c <n>`
+
+**DEPRECATED**
+
+Specifies the maximum concurrency that Perf Analyzer will search up to. Cannot
+be used with `--concurrency-range`.
+
+#### `-d`
+
+**DEPRECATED**
+
+Enables dynamic concurrency mode. Perf Analyzer will search along
+concurrencies up to the maximum concurrency specified via `-c <n>`. Cannot be
+used with `--concurrency-range`.
+
+#### `-t <n>`
+
+**DEPRECATED**
+
+Specifies the number of concurrent requests. Cannot be used with
+`--concurrency-range`.
+
+Default is `1`.
+
+#### `-z`
+
+**DEPRECATED**
+
+Alias for `--input-data=zero`. See `--input-data` option documentation for
+details.
diff --git a/src/c++/perf_analyzer/docs/inference_load_modes.md b/src/c++/perf_analyzer/docs/inference_load_modes.md
new file mode 100644
index 000000000..8b119cea6
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/inference_load_modes.md
@@ -0,0 +1,66 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Inference Load Modes
+
+Perf Analyzer has several modes for generating inference request load for a
+model.
+
+## Concurrency Mode
+
+In concurrency mode, Perf Analyzer attempts to send inference requests to the
+server such that N requests are always outstanding during profiling. For
+example, when using
+[`--concurrency-range=4`](cli.md#--concurrency-rangestartendstep), Perf Analyzer
+will to attempt to have 4 outgoing inference requests at all times during
+profiling.
+
+## Request Rate Mode
+
+In request rate mode, Perf Analyzer attempts to send N inference requests per
+second to the server during profiling. For example, when using
+[`--request-rate-range=20](cli.md#--request-rate-rangestartendstep), Perf
+Analyzer will attempt to send 20 requests per second during profiling.
+
+## Custom Interval Mode
+
+In custom interval mode, Perf Analyzer attempts to send inference requests
+according to intervals (between requests, looping if necessary) provided by the
+user in the form of a text file with one time interval (in microseconds) per
+line. For example, when using
+[`--request-intervals=my_intervals.txt`](cli.md#--request-intervalspath),
+where `my_intervals.txt` contains:
+
+```
+100000
+200000
+500000
+```
+
+Perf Analyzer will attempt to send requests at the following times: 0.1s, 0.3s,
+0.8s, 0.9s, 1.1s, 1.6s, and so on, during profiling.
diff --git a/src/c++/perf_analyzer/docs/input_data.md b/src/c++/perf_analyzer/docs/input_data.md
new file mode 100644
index 000000000..83a305c10
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/input_data.md
@@ -0,0 +1,305 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Input Data
+
+Use the [`--help`](cli.md#--help) option to see complete documentation for all
+input data options. By default Perf Analyzer sends random data to all the inputs
+of your model. You can select a different input data mode with the
+[`--input-data`](cli.md#--input-datazerorandompath) option:
+
+- _random_: (default) Send random data for each input. Note: Perf Analyzer only
+  generates random data once per input and reuses that for all inferences
+- _zero_: Send zeros for each input.
+- directory path: A path to a directory containing a binary file for each input,
+  named the same as the input. Each binary file must contain the data required
+  for that input for a batch-1 request. Each file should contain the raw binary
+  representation of the input in row-major order.
+- file path: A path to a JSON file containing data to be used with every
+  inference request. See the "Real Input Data" section for further details.
+  [`--input-data`](cli.md#--input-datazerorandompath) can be provided multiple
+  times with different file paths to specific multiple JSON files.
+
+For tensors with with `STRING`/`BYTES` datatype, the
+[--string-length](cli.md#--string-lengthn) and
+[`--string-data`](cli.md#--string-datastring) options may be used in some cases
+(see [`--help`](cli.md#--help) for full documentation).
+
+For models that support batching you can use the [`-b`](cli.md#-b-n) option to
+indicate the batch size of the requests that Perf Analyzer should send. For
+models with variable-sized inputs you must provide the
+[`--shape`](cli.md#--shapestring) argument so that Perf Analyzer knows what
+shape tensors to use. For example, for a model that has an input called
+`IMAGE` that has shape `[3, N, M]`, where `N` and `M` are variable-size
+dimensions, to tell Perf Analyzer to send batch size 4 requests of shape
+`[3, 224, 224]`:
+
+```
+$ perf_analyzer -m mymodel -b 4 --shape IMAGE:3,224,224
+```
+
+## Real Input Data
+
+The performance of some models is highly dependent on the data used. For such
+cases you can provide data to be used with every inference request made by Perf
+Analyzer in a JSON file. Perf Analyzer will use the provided data in a
+round-robin order when sending inference requests. For sequence models, if a
+sequence length is specified via
+[`--sequence-length`](cli.md#--sequence-lengthn), Perf Analyzer will also loop
+through the provided data in a round-robin order up to the specified sequence
+length (with a percentage variation customizable via
+[`--sequence-length-variation`](cli.md#--sequence-length-variationn)).
+Otherwise, the sequence length will be the number of inputs specified in
+user-provided input data.
+
+Each entry in the `"data"` array must specify all input tensors with the exact
+size expected by the model for a single batch. The following example describes
+data for a model with inputs named, `INPUT0` and `INPUT1`, shape `[4, 4]` and
+data type `INT32`:
+
+```json
+{
+  "data":
+    [
+      {
+        "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+        "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+      },
+      {
+        "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+        "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+      },
+      {
+        "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+        "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+      },
+      {
+        "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+        "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+      }
+    ]
+}
+```
+
+Note that the `[4, 4]` tensor has been flattened in a row-major format for the
+inputs. In addition to specifying explicit tensors, you can also provide Base64
+encoded binary data for the tensors. Each data object must list its data in a
+row-major order. Binary data must be in little-endian byte order. The following
+example highlights how this can be acheived:
+
+```json
+{
+  "data":
+    [
+      {
+        "INPUT0": {"b64": "YmFzZTY0IGRlY29kZXI="},
+        "INPUT1": {"b64": "YmFzZTY0IGRlY29kZXI="}
+      },
+      {
+        "INPUT0": {"b64": "YmFzZTY0IGRlY29kZXI="},
+        "INPUT1": {"b64": "YmFzZTY0IGRlY29kZXI="}
+      },
+      {
+        "INPUT0": {"b64": "YmFzZTY0IGRlY29kZXI="},
+        "INPUT1": {"b64": "YmFzZTY0IGRlY29kZXI="}
+      }
+    ]
+}
+```
+
+In case of sequence models, multiple data streams can be specified in the JSON
+file. Each sequence will get a data stream of its own and Perf Analyzer will
+ensure the data from each stream is played back to the same correlation ID. The
+below example highlights how to specify data for multiple streams for a sequence
+model with a single input named `INPUT`, shape `[1]` and data type `STRING`:
+
+```json
+{
+  "data":
+    [
+      [
+        {
+          "INPUT": ["1"]
+        },
+        {
+          "INPUT": ["2"]
+        },
+        {
+          "INPUT": ["3"]
+        },
+        {
+          "INPUT": ["4"]
+        }
+      ],
+      [
+        {
+          "INPUT": ["1"]
+        },
+        {
+          "INPUT": ["1"]
+        },
+        {
+          "INPUT": ["1"]
+        }
+      ],
+      [
+        {
+          "INPUT": ["1"]
+        },
+        {
+          "INPUT": ["1"]
+        }
+      ]
+    ]
+}
+```
+
+The above example describes three data streams with lengths 4, 3 and 2
+respectively. Perf Analyzer will hence produce sequences of length 4, 3 and 2 in
+this case.
+
+You can also provide an optional `"shape"` field to the tensors. This is
+especially useful while profiling the models with variable-sized tensors as
+input. Additionally note that when providing the `"shape"` field, tensor
+contents must be provided separately in a "content" field in row-major order.
+The specified shape values will override default input shapes provided as a
+command line option (see [`--shape`](cli.md#--shapestring)) for variable-sized
+inputs. In the absence of a `"shape"` field, the provided defaults will be used.
+There is no need to specify shape as a command line option if all the input data
+provide shape values for variable tensors. Below is an example JSON file for a
+model with a single input `INPUT`, shape `[-1, -1]` and data type `INT32`:
+
+```json
+{
+  "data":
+    [
+      {
+        "INPUT":
+          {
+              "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+              "shape": [2,8]
+          }
+      },
+      {
+        "INPUT":
+          {
+              "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+              "shape": [8,2]
+          }
+      },
+      {
+        "INPUT":
+          {
+              "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+          }
+      },
+      {
+        "INPUT":
+          {
+              "content": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+              "shape": [4,4]
+          }
+      }
+    ]
+}
+```
+
+The following is the example to provide contents as base64 string with explicit
+shapes:
+
+```json
+{
+  "data":
+    [
+      {
+        "INPUT":
+          {
+            "content": {"b64": "/9j/4AAQSkZ(...)"},
+            "shape": [7964]
+          }
+      },
+      {
+        "INPUT":
+          {
+            "content": {"b64": "/9j/4AAQSkZ(...)"},
+            "shape": [7964]
+          }
+      }
+    ]
+}
+```
+
+Note that for `STRING` type, an element is represented by a 4-byte unsigned
+integer giving the length followed by the actual bytes. The byte array to be
+encoded using base64 must include the 4-byte unsigned integers.
+
+### Output Validation
+
+When real input data is provided, it is optional to request Perf Analyzer to
+validate the inference output for the input data.
+
+Validation output can be specified in the `"validation_data"` field have the
+same format as the `"data"` field for real input. Note that the entries in
+`"validation_data"` must align with `"data"` for proper mapping. The following
+example describes validation data for a model with inputs named `INPUT0` and
+`INPUT1`, outputs named `OUTPUT0` and `OUTPUT1`, all tensors have shape `[4, 4]`
+and data type `INT32`:
+
+```json
+{
+  "data":
+    [
+      {
+        "INPUT0": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+        "INPUT1": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
+      }
+    ],
+  "validation_data":
+    [
+      {
+        "OUTPUT0": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+        "OUTPUT1": [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
+      }
+    ]
+}
+```
+
+Besides the above example, the validation outputs can be specified in the same
+variations described in the real input data section.
+
+# Shared Memory
+
+By default Perf Analyzer sends input tensor data and receives output tensor data
+over the network. You can instead instruct Perf Analyzer to use system shared
+memory or CUDA shared memory to communicate tensor data. By using these options
+you can model the performance that you can achieve by using shared memory in
+your application. Use
+[`--shared-memory=system`](cli.md#--shared-memorynonesystemcuda) to use system
+(CPU) shared memory or
+[`--shared-memory=cuda`](cli.md#--shared-memorynonesystemcuda) to use CUDA
+shared memory.
diff --git a/src/c++/perf_analyzer/docs/install.md b/src/c++/perf_analyzer/docs/install.md
new file mode 100644
index 000000000..b5d84a62a
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/install.md
@@ -0,0 +1,106 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Recommended Installation Method
+
+## Triton SDK Container
+
+The recommended way to "install" Perf Analyzer is to run the pre-built
+executable from within the Triton SDK docker container available on the
+[NVIDIA GPU Cloud Catalog](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver).
+As long as the SDK container has its network exposed to the address and port of
+the inference server, Perf Analyzer will be able to run.
+
+```bash
+export RELEASE=<yy.mm> # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02`
+
+docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
+
+docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
+
+# inside container
+perf_analyzer -m <model>
+```
+
+# Alternative Installation Methods
+
+- [Pip](#pip)
+- [Build from Source](#build-from-source)
+
+## Pip
+
+```bash
+pip install tritonclient
+
+perf_analyzer -m <model>
+```
+
+**Warning**: If any runtime dependencies are missing, Perf Analyzer will produce
+errors showing which ones are missing. You will need to manually install them.
+
+## Build from Source
+
+The Triton SDK container is used for building, so some build and runtime
+dependencies are already installed.
+
+```bash
+export RELEASE=<yy.mm> # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02`
+
+docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
+
+docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
+
+# inside container
+# prep installing newer version of cmake
+wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null ; apt-add-repository 'deb https://apt.kitware.com/ubuntu/ focal main'
+
+# install build/runtime dependencies
+apt update ; apt install -y cmake-data=3.21.1-0kitware1ubuntu20.04.1 cmake=3.21.1-0kitware1ubuntu20.04.1 libcurl4-openssl-dev rapidjson-dev
+
+rm -rf client ; git clone --depth 1 https://github.com/triton-inference-server/client
+
+mkdir client/build ; cd client/build
+
+cmake -DTRITON_ENABLE_PERF_ANALYZER=ON ..
+
+make -j8 cc-clients
+
+perf_analyzer -m <model>
+```
+
+- To enable
+  [CUDA shared memory](input_data.md#shared-memory), add
+  `-DTRITON_ENABLE_GPU=ON` to the `cmake` command.
+- To enable
+  [C API mode](benchmarking.md#benchmarking-triton-directly-via-c-api), add
+  `-DTRITON_ENABLE_PERF_ANALYZER_C_API=ON` to the `cmake` command.
+- To enable [TorchServe backend](benchmarking.md#benchmarking-torchserve), add
+  `-DTRITON_ENABLE_PERF_ANALYZER_TS=ON` to the `cmake` command.
+- To enable
+  [Tensorflow Serving backend](benchmarking.md#benchmarking-tensorflow-serving),
+  add `-DTRITON_ENABLE_PERF_ANALYZER_TFS=ON` to the `cmake` command.
diff --git a/src/c++/perf_analyzer/docs/measurements_metrics.md b/src/c++/perf_analyzer/docs/measurements_metrics.md
new file mode 100644
index 000000000..dd6b1ee72
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/measurements_metrics.md
@@ -0,0 +1,224 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Measurement Modes
+
+Currently, Perf Analyzer has 2 measurement modes.
+
+## Time Windows
+
+When using time windows measurement mode
+([`--measurement-mode=time_windows`](cli.md#--measurement-modetime_windowscount_windows)),
+Perf Analyzer will count how many requests have completed during a window of
+duration `X` (in milliseconds, via `--measurement-interval=X`, default is
+`5000`). This is the default measurement mode.
+
+## Count Windows
+
+When using count windows measurement mode
+([`--measurement-mode=count_windows`](cli.md#--measurement-modetime_windowscount_windows)),
+Perf Analyzer will start the window duration at 1 second and potentially
+dynamically increase it until `X` requests have completed (via
+[`--measurement-request-count=X`](cli.md#--measurement-request-countn), default
+is `50`).
+
+# Metrics
+
+## How Throughput is Calculated
+
+Perf Analyzer calculates throughput to be the total number of requests completed
+during a measurement, divided by the duration of the measurement, in seconds.
+
+## How Latency is Calculated
+
+For each request concurrency level Perf Analyzer reports latency and throughput
+as seen from Perf Analyzer and also the average request latency on the server.
+
+The server latency measures the total time from when the request is received at
+the server until when the response is sent from the server. Because of the HTTP
+and gRPC libraries used to implement the server endpoints, total server latency
+is typically more accurate for HTTP requests as it measures time from the first
+byte received until last byte sent. For both HTTP and gRPC the total server
+latency is broken-down into the following components:
+
+- _queue_: The average time spent in the inference schedule queue by a request
+  waiting for an instance of the model to become available.
+- _compute_: The average time spent performing the actual inference, including
+  any time needed to copy data to/from the GPU.
+- _overhead_: The average time spent in the endpoint that cannot be correctly
+  captured in the send/receive time with the way the gRPC and HTTP libraries are
+  structured.
+
+The client latency time is broken-down further for HTTP and gRPC as follows:
+
+- HTTP: _send/recv_ indicates the time on the client spent sending the request
+  and receiving the response. _response wait_ indicates time waiting for the
+  response from the server.
+- gRPC: _(un)marshal request/response_ indicates the time spent marshalling the
+  request data into the gRPC protobuf and unmarshalling the response data from
+  the gRPC protobuf. _response wait_ indicates time writing the gRPC request to
+  the network, waiting for the response, and reading the gRPC response from the
+  network.
+
+Use the verbose ([`-v`](cli.md#-v)) option see more output, including the
+stabilization passes run for each request concurrency level or request rate.
+
+# Reports
+
+## Visualizing Latency vs. Throughput
+
+Perf Analyzer provides the [`-f`](cli.md#-f-path) option to generate a file
+containing CSV output of the results.
+
+```
+$ perf_analyzer -m inception_graphdef --concurrency-range 1:4 -f perf.csv
+...
+$ cat perf.csv
+Concurrency,Inferences/Second,Client Send,Network+Server Send/Recv,Server Queue,Server Compute Input,Server Compute Infer,Server Compute Output,Client Recv,p50 latency,p90 latency,p95 latency,p99 latency
+1,69.2,225,2148,64,206,11781,19,0,13891,18795,19753,21018
+3,84.2,237,1768,21673,209,11742,17,0,35398,43984,47085,51701
+4,84.2,279,1604,33669,233,11731,18,1,47045,56545,59225,64886
+2,87.2,235,1973,9151,190,11346,17,0,21874,28557,29768,34766
+```
+
+NOTE: The rows in the CSV file are sorted in an increasing order of throughput
+(Inferences/Second).
+
+You can import the CSV file into a spreadsheet to help visualize the latency vs
+inferences/second tradeoff as well as see some components of the latency. Follow
+these steps:
+
+- Open
+  [this spreadsheet](https://docs.google.com/spreadsheets/d/1S8h0bWBBElHUoLd2SOvQPzZzRiQ55xjyqodm_9ireiw)
+- Make a copy from the File menu "Make a copy..."
+- Open the copy
+- Select the A1 cell on the "Raw Data" tab
+- From the File menu select "Import..."
+- Select "Upload" and upload the file
+- Select "Replace data at selected cell" and then select the "Import data"
+  button
+
+## Server-side Prometheus metrics
+
+Perf Analyzer can collect
+[server-side metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md#gpu-metrics),
+such as GPU utilization and GPU power usage. To enable the collection of these
+metrics, use the [`--collect-metrics`](cli.md#--collect-metrics) option.
+
+By default, Perf Analyzer queries the metrics endpoint at the URL
+`localhost:8002/metrics`. If the metrics are accessible at a different url, use
+the [`--metrics-url=<url>`](cli.md#--metrics-urlurl) option to specify that.
+
+By default, Perf Analyzer queries the metrics endpoint every 1000 milliseconds.
+To use a different querying interval, use the
+[`--metrics-interval=<n>`](cli.md#--metrics-intervaln) option (specify in
+milliseconds).
+
+Because Perf Analyzer can collect the server-side metrics multiple times per
+run, these metrics are aggregated in specific ways to produce one final number
+per searched concurrency or request rate. Here are how the metrics are
+aggregated:
+
+| Metric | Aggregation |
+| - | - |
+| GPU Utilization | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. |
+| GPU Power Usage | Averaged from each collection taken during stable passes. We want a number representative of all stable passes. |
+| GPU Used Memory | Maximum from all collections taken during a stable pass. Users are typically curious what the peak memory usage is for determining model/hardware viability. |
+| GPU Total Memory | First from any collection taken during a stable pass. All of the collections should produce the same value for total memory available on the GPU. |
+
+Note that all metrics are per-GPU in the case of multi-GPU systems.
+
+To output these server-side metrics to a CSV file, use the
+[`-f <path>`](cli.md#-f-path) and [`--verbose-csv`](cli.md#--verbose-csv)
+options. The output CSV will contain one column per metric. The value of each
+column will be a `key:value` pair (`GPU UUID:metric value`). Each `key:value`
+pair will be delimited by a semicolon (`;`) to indicate metric values for each
+GPU accessible by the server. There is a trailing semicolon. See below:
+
+`<gpu-uuid-0>:<metric-value>;<gpu-uuid-1>:<metric-value>;...;`
+
+Here is a simplified CSV output:
+
+```
+$ perf_analyzer -m resnet50_libtorch --collect-metrics -f output.csv --verbose-csv
+$ cat output.csv
+Concurrency,...,Avg GPU Utilization,Avg GPU Power Usage,Max GPU Memory Usage,Total GPU Memory
+1,...,gpu_uuid_0:0.33;gpu_uuid_1:0.5;,gpu_uuid_0:55.3;gpu_uuid_1:56.9;,gpu_uuid_0:10000;gpu_uuid_1:11000;,gpu_uuid_0:50000;gpu_uuid_1:75000;,
+2,...,gpu_uuid_0:0.25;gpu_uuid_1:0.6;,gpu_uuid_0:25.6;gpu_uuid_1:77.2;,gpu_uuid_0:11000;gpu_uuid_1:17000;,gpu_uuid_0:50000;gpu_uuid_1:75000;,
+3,...,gpu_uuid_0:0.87;gpu_uuid_1:0.9;,gpu_uuid_0:87.1;gpu_uuid_1:71.7;,gpu_uuid_0:15000;gpu_uuid_1:22000;,gpu_uuid_0:50000;gpu_uuid_1:75000;,
+```
+
+## Communication Protocol
+
+By default, Perf Analyzer uses HTTP to communicate with Triton. The gRPC
+protocol can be specificed with the [`-i [http|grpc]`](cli.md#-i-httpgrpc)
+option. If gRPC is selected the [`--streaming`](cli.md#--streaming) option can
+also be specified for gRPC streaming.
+
+### SSL/TLS Support
+
+Perf Analyzer can be used to benchmark Triton service behind SSL/TLS-enabled
+endpoints. These options can help in establishing secure connection with the
+endpoint and profile the server.
+
+For gRPC, see the following options:
+
+- [`--ssl-grpc-use-ssl`](cli.md#--ssl-grpc-use-ssl)
+- [`--ssl-grpc-root-certifications-file=<path>`](cli.md#--ssl-grpc-root-certifications-filepath)
+- [`--ssl-grpc-private-key-file=<path>`](cli.md#--ssl-grpc-private-key-filepath)
+- [`--ssl-grpc-certificate-chain-file=<path>`](cli.md#--ssl-grpc-certificate-chain-filepath)
+
+More details here:
+https://grpc.github.io/grpc/cpp/structgrpc_1_1_ssl_credentials_options.html
+
+The
+[inference protocol gRPC SSL/TLS section](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#ssltls)
+describes server-side options to configure SSL/TLS in Triton's gRPC endpoint.
+
+For HTTPS, the following options are exposed:
+
+- [`--ssl-https-verify-peer`](cli.md#--ssl-https-verify-peer01)
+- [`--ssl-https-verify-host`](cli.md#--ssl-https-verify-host012)
+- [`--ssl-https-ca-certificates-file`](cli.md#--ssl-https-ca-certificates-filepath)
+- [`--ssl-https-client-certificate-file`](cli.md#--ssl-https-client-certificate-filepath)
+- [`--ssl-https-client-certificate-type`](cli.md#--ssl-https-client-certificate-typepemder)
+- [`--ssl-https-private-key-file`](cli.md#--ssl-https-private-key-filepath)
+- [`--ssl-https-private-key-type`](cli.md#--ssl-https-private-key-typepemder)
+
+See [`--help`](cli.md#--help) for full documentation.
+
+Unlike gRPC, Triton's HTTP server endpoint can not be configured with SSL/TLS
+support.
+
+Note: Just providing these `--ssl-http-*` options to Perf Analyzer does not
+ensure that SSL/TLS is used in communication. If SSL/TLS is not enabled on the
+service endpoint, these options have no effect. The intent of exposing these
+options to a user of Perf Analyzer is to allow them to configure Perf Analyzer
+to benchmark a Triton service behind SSL/TLS-enabled endpoints. In other words,
+if Triton is running behind a HTTPS server proxy, then these options would allow
+Perf Analyzer to profile Triton via exposed HTTPS proxy.
diff --git a/src/c++/perf_analyzer/docs/quick_start.md b/src/c++/perf_analyzer/docs/quick_start.md
new file mode 100644
index 000000000..cfcc2b3d1
--- /dev/null
+++ b/src/c++/perf_analyzer/docs/quick_start.md
@@ -0,0 +1,114 @@
+<!--
+Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Quick Start
+
+The steps below will guide you on how to start using Perf Analyzer.
+
+### Step 1: Start Triton Container
+
+```bash
+export RELEASE=<yy.mm> # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02`
+
+docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3
+
+docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3
+```
+
+### Step 2: Download `simple` Model
+
+```bash
+# inside triton container
+git clone --depth 1 https://github.com/triton-inference-server/server
+
+mkdir model_repository ; cp -r server/docs/examples/model_repository/simple model_repository
+```
+
+### Step 3: Start Triton Server
+
+```bash
+# inside triton container
+tritonserver --model-repository $(pwd)/model_repository &> server.log &
+
+# confirm server is ready, look for 'HTTP/1.1 200 OK'
+curl -v localhost:8000/v2/health/ready
+
+# detatch (CTRL-p CTRL-q)
+```
+
+### Step 4: Start Triton SDK Container
+
+```bash
+docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
+
+docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
+```
+
+### Step 5: Run Perf Analyzer
+
+```bash
+# inside sdk container
+perf_analyzer -m simple
+```
+
+### Step 6: Observe and Analyze Output
+
+```
+$ perf_analyzer -m simple
+*** Measurement Settings ***
+  Batch size: 1
+  Service Kind: Triton
+  Using "time_windows" mode for stabilization
+  Measurement window: 5000 msec
+  Using synchronous calls for inference
+  Stabilizing using average latency
+
+Request concurrency: 1
+  Client: 
+    Request count: 25348
+    Throughput: 1407.84 infer/sec
+    Avg latency: 708 usec (standard deviation 663 usec)
+    p50 latency: 690 usec
+    p90 latency: 881 usec
+    p95 latency: 926 usec
+    p99 latency: 1031 usec
+    Avg HTTP time: 700 usec (send/recv 102 usec + response wait 598 usec)
+  Server: 
+    Inference count: 25348
+    Execution count: 25348
+    Successful request count: 25348
+    Avg request latency: 382 usec (overhead 41 usec + queue 41 usec + compute input 26 usec + compute infer 257 usec + compute output 16 usec)
+
+Inferences/Second vs. Client Average Batch Latency
+Concurrency: 1, throughput: 1407.84 infer/sec, latency 708 usec
+```
+
+We can see from the output that the model was able to complete approximately
+1407.84 inferences per second, with an average latency of 708 microseconds per
+inference request. Concurrency of 1 meant that Perf Analyzer attempted to always
+have 1 outgoing request at all times.