Skip to content

Commit 5f9f5cf

Browse files
committed
update to use the ROCm 6.0 torch 2.1.1 as default base docker image
1 parent fae2464 commit 5f9f5cf

File tree

4 files changed

+35
-19
lines changed

4 files changed

+35
-19
lines changed

Dockerfile.rocm

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,14 @@
11
# default base image
2-
ARG BASE_IMAGE="rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1"
2+
ARG BASE_IMAGE="rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1"
33

44
FROM $BASE_IMAGE
55

6-
ARG BASE_IMAGE="rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1"
6+
ARG BASE_IMAGE="rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1"
77

88
RUN echo "Base image is $BASE_IMAGE"
99

1010
# BASE_IMAGE for ROCm_5.7: "rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1"
11-
# BASE_IMAGE for ROCm_6.0: "rocm/pytorch:rocm6.0_ubuntu22.04_py3.9_pytorch_2.0.1"
12-
# Testing image: "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/release-public:rocm6.0_ubuntu20.04_py3.9_pytorch_rocm6.0_internal_testing"
11+
# BASE_IMAGE for ROCm_6.0: "rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1"
1312

1413
# this does not always work for all rocm versions
1514
RUN LLVM_GFX_ARCH=$(/opt/rocm/llvm/bin/amdgpu-offload-arch) && \
@@ -26,7 +25,6 @@ RUN echo "FA_BRANCH is $FA_BRANCH"
2625
# Install some basic utilities
2726
RUN apt-get update && apt-get install python3 python3-pip -y
2827

29-
3028
# Install some basic utilities
3129
RUN apt-get update && apt-get install -y \
3230
curl \
@@ -72,7 +70,12 @@ RUN mkdir libs \
7270
COPY ./ /app/vllm
7371

7472
RUN python3 -m pip install --upgrade pip
75-
RUN pip install xformers==0.0.23 --no-deps
73+
RUN python3 -m pip install xformers==0.0.23 --no-deps
74+
75+
# Error related to odd state for numpy 1.20.3 where there is no METADATA etc, but an extra LICENSES_bundled.txt.
76+
# Manually removed it so that later steps of numpy upgrade can continue
77+
RUN if [ "$BASE_IMAGE" = "rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1" ]; then \
78+
rm -rf /opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy-1.20.3.dist-info/; fi
7679

7780
RUN cd /app \
7881
&& cd vllm \

docs/source/getting_started/amd-installation.rst

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -97,22 +97,22 @@ You can build and install vLLM from source:
9797

9898
Build a docker image from `Dockerfile.rocm`, and launch a docker container.
9999

100-
The `Dokerfile.rocm` is designed to support both ROCm 5.7 and ROCm 6.0. It provides flexibility to customize the build of docker image using the following arguments:
100+
The `Dokerfile.rocm` is designed to support both ROCm 5.7 and ROCm 6.0 and later versions. It provides flexibility to customize the build of docker image using the following arguments:
101101

102-
* `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image. We have tested ROCm 5.7 and ROCm 6.0. The default is `rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1`
102+
* `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image. We have tested ROCm 5.7 and ROCm 6.0. The default is `rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1`
103103
* `FX_GFX_ARCHS`: specifies the GFX architecture that is used to build flash-attention, for example, `gfx90a;gfx942` for MI200 and MI300. The default is `gfx90a;gfx942`
104104
* `FA_BRANCH`: specifies the branch used to build the flash-attention in `ROCmSoftwarePlatform's flash-attention repo <https://github.com/ROCmSoftwarePlatform/flash-attention>`_. The default is `3d2b6f5`
105105

106106
Their values can be passed in when running `docker build` with `--build-arg` options.
107107

108-
For example, to build docker image for vllm on ROCm 6.0, you can run:
108+
For example, to build docker image for vllm on ROCm 5.7, you can run:
109109

110110
.. code-block:: console
111111
112-
$ docker build --build-arg BASE_IMAGE="compute-artifactory.amd.com:5000/rocm-plus-docker/framework/release-public:rocm6.0_ubuntu20.04_py3.9_pytorch_rocm6.0_internal_testing" \
112+
$ docker build --build-arg BASE_IMAGE="rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1" \
113113
-f Dockerfile.rocm -t vllm-rocm .
114114
115-
To build vllm on ROCm 5.7, you can use the default:
115+
To build vllm on ROCm 6.0, you can use the default:
116116

117117
.. code-block:: console
118118
@@ -161,3 +161,8 @@ Alternatively, if you plan to install vLLM-ROCm on a local machine or start from
161161
$ cd vllm
162162
$ pip install -U -r requirements-rocm.txt
163163
$ python setup.py install # This may take 5-10 minutes.
164+
165+
.. note::
166+
167+
- You may need to turn on the "--enforce-eager" flag if you experience process hang when running the `run_benchmark.py` script to test your installation.
168+

vllm/utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ def get_max_shared_memory_bytes(gpu: int = 0) -> int:
4343
cudaDevAttrMaxSharedMemoryPerBlockOptin, gpu)
4444
if max_shared_mem == 0 and is_hip():
4545
# got 0 sometimes when using 74
46+
print("get_max_shared_memory_bytes got 0, trying to use value 97 for ROCm")
4647
cudaDevAttrMaxSharedMemoryPerBlockOptin = 97
4748
max_shared_mem = cuda_utils.get_device_attribute(
4849
cudaDevAttrMaxSharedMemoryPerBlockOptin, gpu)

vllm/worker/worker.py

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -65,20 +65,27 @@ def init_model(self) -> None:
6565

6666
# This env var set by Ray causes exceptions with graph building.
6767
os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
68-
69-
# This caused problem for rank non-0 (for example, 1 when -tp 2), when calling torch.cuda.set_device(self.device) in ROCm.
70-
# HIP Error invalid device ordial
71-
# where CUDA_VISIABLE_DEVICES=0,1, and set_device with cuda:1.
68+
7269
try:
7370
self.device = torch.device(f"cuda:{self.local_rank}")
7471
torch.cuda.set_device(self.device)
7572
except RuntimeError as re:
73+
# On certain versions, we experienced RuntimeError for rank non-0 when running with tensor-parallel option on ROCm.
74+
# For example, for option, -tp 2, calling torch.cuda.set_device(self.device) for device 1 would throw the following error:
75+
# HIP Error invalid device ordial
76+
# By debugging, we found that CUDA_VISIABLE_DEVICES=0,1, but device_count is 1 and env HIP_VISIBLE_DEVICES is None.
77+
# below is a work around when that happens so that we can continue
78+
device_count = torch.cuda.device_count()
7679
print(
77-
f"RuntimeError {re} in cuda.set_device {self.device}, visible device={os.environ.get('CUDA_VISIBLE_DEVICES')}. "
80+
f"RuntimeError {re} in cuda.set_device {self.device}, device_count={device_count}. "
7881
)
79-
self.device = torch.device("cuda:0")
80-
print(f"Trying get around by set_device to {self.device}")
81-
torch.cuda.set_device(self.device)
82+
if device_count > 0:
83+
self.device = torch.device("cuda:0")
84+
print(f"Trying get around by set_device to {self.device}")
85+
torch.cuda.set_device(self.device)
86+
else:
87+
# no work around is available
88+
raise
8289

8390
_check_if_gpu_supports_dtype(self.model_config.dtype)
8491

0 commit comments

Comments
 (0)