You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/contributing/profiling/profiling_index.md
+88-3Lines changed: 88 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,8 @@
4
4
Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
5
5
:::
6
6
7
+
## Profile with PyTorch Profiler
8
+
7
9
We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`
8
10
9
11
The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.
@@ -22,13 +24,13 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve
22
24
`export VLLM_RPC_TIMEOUT=1800000`
23
25
:::
24
26
25
-
## Example commands and usage
27
+
###Example commands and usage
26
28
27
-
### Offline Inference
29
+
####Offline Inference
28
30
29
31
Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example.
Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.
48
+
49
+
[Install nsight-systems](https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html) using your package manager.
50
+
The following block is an example for Ubuntu.
51
+
52
+
```bash
53
+
apt update
54
+
apt install -y --no-install-recommends gnupg
55
+
echo"deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release;echo"$DISTRIB_RELEASE"| tr -d .)/$(dpkg --print-architecture) /"| tee /etc/apt/sources.list.d/nvidia-devtools.list
For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference.
66
+
67
+
The following is an example using the `benchmarks/benchmark_latency.py` script:
To profile the server, you will want to prepend your `vllm serve` command with `nsys profile` just like for offline inference, however you must specify `--delay XX --duration YY` parameters according to the needs of your benchmark. After the duration time has been used up, the server will be killed.
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
86
+
87
+
```
88
+
nsys sessions list
89
+
```
90
+
91
+
to get the session id in the form of `profile-XXXXX`, then run:
92
+
93
+
```
94
+
nsys stop --session=profile-XXXXX
95
+
```
96
+
97
+
to manually kill the profiler and generate your `nsys-rep` report.
98
+
99
+
#### Analysis
100
+
101
+
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
102
+
103
+
CLI example:
104
+
105
+
```bash
106
+
nsys stats report1.nsys-rep
107
+
...
108
+
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
109
+
110
+
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
0 commit comments