You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Because of a known issue of C++11 ABI compatibility within the NGC pytorch container,
74
-
> we rebuild TensorRT-LLM from source. See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
75
-
> for more information.
76
-
>
77
-
> Hence, when running this script for the first time, the time taken by this script can be
78
-
> quite long.
79
-
80
-
81
72
### Run container
82
73
83
74
```
@@ -306,13 +297,54 @@ See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) secti
306
297
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
307
298
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
308
299
309
-
### Future Work
310
300
311
-
Remaining tasks:
312
-
-[x] Add support for the disaggregated serving.
313
-
-[x] Add multi-node support.
314
-
-[x] Add instructions for benchmarking.
315
-
-[x] Use processor from dynamo-llm framework.
316
-
-[ ] Add integration test coverage.
317
-
-[ ] Merge the code base with llm example to reduce the code duplication.
318
-
-[ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
301
+
### Disaggregated Serving with KV Cache Transfer using **NIXL** (EXPERIMENTAL)
302
+
303
+
In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer:
304
+
305
+
#### Default Method: UCX
306
+
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
307
+
308
+
#### Experimental Method: NIXL
309
+
TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA InfiniBand eXchange Library) for KV cache transfer. NIXL is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
310
+
311
+
**Note:** NIXL support is experimental and is not be suitable for production environments.
312
+
313
+
#### Using NIXL for KV Cache Transfer
314
+
315
+
To enable NIXL for KV cache transfer in disaggregated serving:
316
+
317
+
1.**Build the container with NIXL support:**
318
+
The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support.
319
+
320
+
**Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):**
321
+
```bash
322
+
rm -rf /tmp/trtllm_wheel
323
+
```
324
+
325
+
**Build the container with NIXL support:**
326
+
```bash
327
+
./container/build.sh --framework tensorrtllm \
328
+
--use-default-experimental-tensorrtllm-commit \
329
+
--trtllm-use-nixl-kvcache-experimental
330
+
```
331
+
332
+
**Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support.
333
+
334
+
2.**Run the containerized environment:**
335
+
See [run container](#run-container) section to learn how to start the container image built in previous step.
336
+
337
+
3.**Start the disaggregated service:**
338
+
See [disaggregated serving](#disaggregated-serving) to see how to start the deployment.
339
+
340
+
4.**Send the request:**
341
+
See [client](#client) section to learn how to send the request to deployment.
342
+
343
+
**Important:** Ensure that ETCD and NATS services are running before starting the service.
344
+
345
+
The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer.
0 commit comments