-
Notifications
You must be signed in to change notification settings - Fork 438
feat: Using NIXL for KV cache transfer when using disaggregated serving in TRTLLM #1591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
724341b
057f01d
42c2fd2
378c945
cca5921
b17e7d8
9807177
772014e
76f5835
cb91029
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
#!/bin/bash -e | ||
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# Install NIXL for TensorRT-LLM. | ||
# This script is an adapted version of the NIXL install script from the TensorRT-LLM repository. | ||
# The original script is located at: | ||
# https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/common/install_nixl.sh | ||
|
||
set -ex | ||
|
||
GITHUB_URL="https://github.com" | ||
|
||
UCX_VERSION="v1.18.1" | ||
UCX_INSTALL_PATH="/usr/local/ucx/" | ||
CUDA_PATH="/usr/local/cuda" | ||
|
||
NIXL_COMMIT="16348080f5bdeb9fe6058a23be140cec020ef3f3" | ||
|
||
UCX_REPO="https://github.com/openucx/ucx.git" | ||
NIXL_REPO="https://github.com/ai-dynamo/nixl.git" | ||
|
||
|
||
|
||
|
||
if [ ! -d ${UCX_INSTALL_PATH} ]; then | ||
git clone --depth 1 -b ${UCX_VERSION} ${UCX_REPO} | ||
cd ucx | ||
./autogen.sh | ||
./contrib/configure-release \ | ||
--prefix=${UCX_INSTALL_PATH} \ | ||
--enable-shared \ | ||
--disable-static \ | ||
--disable-doxygen-doc \ | ||
--enable-optimizations \ | ||
--enable-cma \ | ||
--enable-devel-headers \ | ||
--with-cuda=${CUDA_PATH} \ | ||
--with-verbs \ | ||
--with-dm \ | ||
--enable-mt | ||
make install -j$(nproc) | ||
cd .. | ||
rm -rf ucx # Remove UCX source to save space | ||
echo "export LD_LIBRARY_PATH=${UCX_INSTALL_PATH}/lib:\$LD_LIBRARY_PATH" >> "${ENV}" | ||
fi | ||
|
||
ARCH_NAME="x86_64-linux-gnu" | ||
if [ "$(uname -m)" != "amd64" ] && [ "$(uname -m)" != "x86_64" ]; then | ||
ARCH_NAME="aarch64-linux-gnu" | ||
EXTRA_NIXL_ARGS="-Ddisable_gds_backend=true" | ||
fi | ||
|
||
if [ $ARCH_NAME != "x86_64-linux-gnu" ]; then | ||
echo "The NIXL backend is temporarily unavailable on the aarch64 platform. Exiting script." | ||
exit 0 | ||
fi | ||
|
||
pip3 install --no-cache-dir meson ninja pybind11 | ||
git clone ${NIXL_REPO} nixl | ||
cd nixl | ||
git checkout ${NIXL_COMMIT} | ||
meson setup builddir -Ducx_path=${UCX_INSTALL_PATH} -Dstatic_plugins=UCX -Dbuildtype=release ${EXTRA_NIXL_ARGS} | ||
cd builddir && ninja install | ||
cd ../.. | ||
rm -rf nixl* # Remove NIXL source tree to save space | ||
|
||
echo "export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/${ARCH_NAME}:/opt/nvidia/nvda_nixl/lib64:\$LD_LIBRARY_PATH" >> "${ENV}" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,15 +69,6 @@ apt-get update && apt-get -y install git git-lfs | |
./container/build.sh --framework tensorrtllm --use-default-experimental-tensorrtllm-commit | ||
``` | ||
|
||
> [!NOTE] | ||
> Because of a known issue of C++11 ABI compatibility within the NGC pytorch container, | ||
> we rebuild TensorRT-LLM from source. See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html) | ||
> for more information. | ||
> | ||
> Hence, when running this script for the first time, the time taken by this script can be | ||
> quite long. | ||
|
||
|
||
### Run container | ||
|
||
``` | ||
|
@@ -306,13 +297,54 @@ See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) secti | |
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the | ||
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh) | ||
|
||
### Future Work | ||
|
||
Remaining tasks: | ||
- [x] Add support for the disaggregated serving. | ||
- [x] Add multi-node support. | ||
- [x] Add instructions for benchmarking. | ||
- [x] Use processor from dynamo-llm framework. | ||
- [ ] Add integration test coverage. | ||
- [ ] Merge the code base with llm example to reduce the code duplication. | ||
- [ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache. | ||
### KV Cache Transfer for Disaggregated Serving | ||
|
||
In disaggregated serving architectures, KV cache must be transferred between prefill and decode nodes. TensorRT-LLM supports two methods for this transfer: | ||
|
||
#### Default Method: UCX | ||
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode nodes. UCX provides high-performance communication optimized for GPU-to-GPU transfers. | ||
|
||
#### Experimental Method: NIXL | ||
TensorRT-LLM also provides experimental support for using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments. | ||
|
||
**Note:** NIXL support in TensorRT-LLM is experimental and is not suitable for production environments yet. | ||
|
||
#### Using NIXL for KV Cache Transfer | ||
|
||
To enable NIXL for KV cache transfer in disaggregated serving: | ||
|
||
1. **Build the container with NIXL support:** | ||
The TensorRT-LLM wheel must be built from source with NIXL support. The `./container/build.sh` script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support. | ||
|
||
**Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):** | ||
```bash | ||
rm -rf /tmp/trtllm_wheel | ||
``` | ||
|
||
**Build the container with NIXL support:** | ||
```bash | ||
./container/build.sh --framework tensorrtllm \ | ||
--use-default-experimental-tensorrtllm-commit \ | ||
--trtllm-use-nixl-kvcache-experimental | ||
``` | ||
|
||
**Note:** Both `--use-default-experimental-tensorrtllm-commit` and `--trtllm-use-nixl-kvcache-experimental` flags are required to enable NIXL support. | ||
|
||
2. **Run the containerized environment:** | ||
See [run container](#run-container) section to learn how to start the container image built in previous step. | ||
|
||
3. **Start the disaggregated service:** | ||
See [disaggregated serving](#disaggregated-serving) to see how to start the deployment. | ||
|
||
4. **Send the request:** | ||
See [client](#client) section to learn how to send the request to deployment. | ||
|
||
**Important:** Ensure that ETCD and NATS services are running before starting the service. | ||
|
||
The container will automatically configure the appropriate environment variables (`TRTLLM_USE_NIXL_KVCACHE=1`) when built with the NIXL flag. The same container image can be used to use UCX for KV cache transfer. | ||
```bash | ||
unset TRTLLM_USE_NIXL_KVCACHE | ||
export TRTLLM_USE_UCX_KVCACHE=1 | ||
Comment on lines
+347
to
+348
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth adding all this extra build logic, flags, documentation, etc. if we can just boil down the opt-in steps to something like this? NIXL (Experimental)To use NIXL for KV Cache transfer instead of UCX, set the relevant environment variables: export TRTLLM_USE_NIXL_KVCACHE=1
export TRTLLM_USE_UCX_KVCACHE=0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are currently using the TensorRT-LLM wheel from pyPI by default. This has reduced the container build time considerably. However, these public wheels are not build with NIXL support. Currently, there is no way around rebuilding the TRTLLM wheel from scratch with some extra flags if someone wants to use NIXL. We could have enabled NIXL by default but it would have undermined our build time optimization from using pyPI TRTLLM wheels. Once NIXL in TRTLLM is mature enough and available in TRTLLM wheels on pypi, we can simplify instructions to what you have mentioned above. |
||
``` | ||
|
Uh oh!
There was an error while loading. Please reload this page.