Skip to content

Commit

Permalink
Add quick start documentation page (#261)
Browse files Browse the repository at this point in the history
* Add quick start documentation page

* Addressed comments

* Addressed comments

* Fixed typo

* Addressed comments
  • Loading branch information
matthewkotila committed Mar 16, 2023
1 parent ca9d47a commit 1b08e59
Showing 1 changed file with 91 additions and 0 deletions.
91 changes: 91 additions & 0 deletions src/c++/perf_analyzer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,97 @@ Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 69.6 infer/sec, latency 19673 usec
```

## Quick Start

The steps below will guide you through using Perf Analyzer to profile a simple
Tensorflow model: `simple`.

### Step 1: Start Triton Container

```bash
export RELEASE=<yy.mm> # e.g. to use the release from the end of February of 2023, do `export RELEASE=23.02`

docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3

docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3
```

### Step 2: Download `simple` Model

```bash
# inside triton container
git clone --depth 1 https://github.com/triton-inference-server/server

mkdir model_repository

cp -r server/docs/examples/model_repository/simple model_repository
```

### Step 3: Start Triton Server

```bash
# inside triton container
tritonserver --model-repository $(pwd)/model_repository &> server.log &

# confirm server is ready
curl -v localhost:8000/v2/health/ready
# look for 'HTTP/1.1 200 OK'

# detatch (CTRL-p CTRL-q)
```

### Step 4: Start Triton SDK Container

```bash
docker pull nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk

docker run --gpus all --rm -it --net host nvcr.io/nvidia/tritonserver:${RELEASE}-py3-sdk
```

### Step 5: Run Perf Analyzer

```bash
# inside sdk container
perf_analyzer -m simple
```

### Step 6: Observe and Analyze Output

```
$ perf_analyzer -m simple
*** Measurement Settings ***
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 25348
Throughput: 1407.84 infer/sec
Avg latency: 708 usec (standard deviation 663 usec)
p50 latency: 690 usec
p90 latency: 881 usec
p95 latency: 926 usec
p99 latency: 1031 usec
Avg HTTP time: 700 usec (send/recv 102 usec + response wait 598 usec)
Server:
Inference count: 25348
Execution count: 25348
Successful request count: 25348
Avg request latency: 382 usec (overhead 41 usec + queue 41 usec + compute input 26 usec + compute infer 257 usec + compute output 16 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1407.84 infer/sec, latency 708 usec
```

We can see from the output that the model was able to complete approximately
1407.84 inferences per second, with an average latency of 708 microseconds per
inference request. Concurrency of 1 meant that Perf Analyzer attempted to always
have 1 outgoing request at all times.

## Request Concurrency

By default perf_analyzer measures your model's latency and throughput
Expand Down

0 comments on commit 1b08e59

Please sign in to comment.