Skip to content

Commit 4e81eb2

Browse files
authored
Update vllm to use latest upstream to support CPU (#179)
* update vllm to use upstream v0.4.0.post1 Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * nit Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * adjust watch list Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Add llm_on_ray package installation and set CPU key-value cache size Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Remove device=infer_conf.device and add comment explaining why Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Update VLLM installation script to use main commit Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Update GCC version detection in install-vllm-cpu.sh script Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Update vllm-cpu installation method * Fix Docker build command and update YAML configuration files Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Add VLLM_CPU_KVCACHE_SPACE_DEFAULT constant to control the size of the CPU key-value cache Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * update Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * nit Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Update default value of VLLM_CPU_KVCACHE_SPACE to 40GB Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Fix indentation in workflow_inference.yml Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * debug Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * debug Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * debug Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * nit Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * nit Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * debug Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> * Enable non-gated and gated models access Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> --------- Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com>
1 parent 83803da commit 4e81eb2

File tree

7 files changed

+29
-59
lines changed

7 files changed

+29
-59
lines changed

.github/workflows/workflow_finetune.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ jobs:
7070

7171
- name: Build Docker Image
7272
run: |
73-
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile.cpu_and_deepspeed -t finetune:latest
73+
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile.cpu_and_deepspeed -t finetune:latest
7474
docker container prune -f
7575
docker image prune -f
7676

.github/workflows/workflow_inference.yml

Lines changed: 3 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ jobs:
9696
DF_SUFFIX=".cpu_and_deepspeed"
9797
fi
9898
TARGET=${{steps.target.outputs.target}}
99-
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile${DF_SUFFIX} -t ${TARGET}:latest
99+
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile${DF_SUFFIX} -t ${TARGET}:latest
100100
docker container prune -f
101101
docker image prune -f
102102
@@ -118,32 +118,8 @@ jobs:
118118
- name: Run Inference Test
119119
run: |
120120
TARGET=${{steps.target.outputs.target}}
121-
CMD=$(cat << EOF
122-
import yaml
123-
if ("${{ matrix.model }}" == "starcoder"):
124-
conf_path = "llm_on_ray/inference/models/starcoder.yaml"
125-
with open(conf_path, encoding="utf-8") as reader:
126-
result = yaml.load(reader, Loader=yaml.FullLoader)
127-
result['model_description']["config"]["use_auth_token"] = "${{ env.HF_ACCESS_TOKEN }}"
128-
with open(conf_path, 'w') as output:
129-
yaml.dump(result, output, sort_keys=False)
130-
if ("${{ matrix.model }}" == "llama-2-7b-chat-hf"):
131-
conf_path = "llm_on_ray/inference/models/llama-2-7b-chat-hf.yaml"
132-
with open(conf_path, encoding="utf-8") as reader:
133-
result = yaml.load(reader, Loader=yaml.FullLoader)
134-
result['model_description']["config"]["use_auth_token"] = "${{ env.HF_ACCESS_TOKEN }}"
135-
with open(conf_path, 'w') as output:
136-
yaml.dump(result, output, sort_keys=False)
137-
if ("${{ matrix.model }}" == "gemma-2b"):
138-
conf_path = "llm_on_ray/inference/models/gemma-2b.yaml"
139-
with open(conf_path, encoding="utf-8") as reader:
140-
result = yaml.load(reader, Loader=yaml.FullLoader)
141-
result['model_description']["config"]["use_auth_token"] = "${{ env.HF_ACCESS_TOKEN }}"
142-
with open(conf_path, 'w') as output:
143-
yaml.dump(result, output, sort_keys=False)
144-
EOF
145-
)
146-
docker exec "${TARGET}" python -c "$CMD"
121+
# Enable non-gated and gated models access
122+
docker exec "${TARGET}" bash -c "huggingface-cli login --token ${{ env.HF_ACCESS_TOKEN }}"
147123
if [[ ${{ matrix.model }} == "mpt-7b-ipex-llm" ]]; then
148124
docker exec "${TARGET}" bash -c "llm_on_ray-serve --config_file llm_on_ray/inference/models/ipex-llm/mpt-7b-ipex-llm.yaml --simple"
149125
elif [[ ${{ matrix.model }} == "llama-2-7b-chat-hf-vllm" ]]; then

.github/workflows/workflow_orders_on_merge.yml

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,8 @@ on:
55
branches:
66
- main
77
paths:
8-
- '.github/**'
9-
- 'docker/**'
10-
- 'dev/docker/**'
11-
- 'llm_on_ray/common/**'
12-
- 'llm_on_ray/finetune/**'
13-
- 'llm_on_ray/inference/**'
14-
- 'llm_on_ray/rlhf/**'
15-
- 'tools/**'
16-
- 'pyproject.toml'
17-
- 'tests/**'
8+
- '**'
9+
- '!*.md'
1810

1911
jobs:
2012
Lint:

.github/workflows/workflow_orders_on_pr.yml

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,8 @@ on:
55
branches:
66
- main
77
paths:
8-
- '.github/**'
9-
- 'docker/**'
10-
- 'dev/docker/**'
11-
- 'llm_on_ray/common/**'
12-
- 'llm_on_ray/finetune/**'
13-
- 'llm_on_ray/inference/**'
14-
- 'llm_on_ray/rlhf/**'
15-
- 'tools/**'
16-
- 'pyproject.toml'
17-
- 'tests/**'
8+
- '**'
9+
- '!*.md'
1810

1911
jobs:
2012

dev/docker/Dockerfile.vllm

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,14 @@ COPY ./pyproject.toml .
2828
COPY ./MANIFEST.in .
2929
COPY ./dev/scripts/install-vllm-cpu.sh .
3030

31-
# create llm_on_ray package directory to bypass the following 'pip install -e' command
32-
RUN mkdir ./llm_on_ray
33-
34-
RUN --mount=type=cache,target=/root/.cache/pip pip install -e .[cpu] --extra-index-url https://download.pytorch.org/whl/cpu \
35-
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
36-
3731
# Install vllm-cpu
3832
# Activate base first for loading g++ envs ($CONDA_PREFIX/etc/conda/activate.d/*)
3933
RUN --mount=type=cache,target=/root/.cache/pip \
4034
source /opt/conda/bin/activate base && ./install-vllm-cpu.sh
4135

36+
# Install llm_on_ray
37+
# Create llm_on_ray package directory to bypass the following 'pip install -e' command
38+
RUN mkdir ./llm_on_ray
39+
RUN --mount=type=cache,target=/root/.cache/pip pip install -e .[cpu] --extra-index-url https://download.pytorch.org/whl/cpu \
40+
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
41+

dev/scripts/install-vllm-cpu.sh

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,21 @@
44
[[ -n $(which g++) ]] || { echo "GNU C++ Compiler (g++) is not found!"; exit 1; }
55
[[ -n $(which pip) ]] || { echo "pip command is not found!"; exit 1; }
66

7-
# g++ version should be >=12.3
7+
# g++ version should be >=12.3. On Ubuntu 22.4, you can run:
8+
# sudo apt-get update -y
9+
# sudo apt-get install -y gcc-12 g++-12
10+
# sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
811
version_greater_equal()
912
{
1013
printf '%s\n%s\n' "$2" "$1" | sort --check=quiet --version-sort
1114
}
12-
gcc_version=$(g++ -dumpversion)
15+
gcc_version=$(g++ --version | grep -o -E '[0-9]+\.[0-9]+\.[0-9]+' | head -n1)
1316
echo
1417
echo Current GNU C++ Compiler version: $gcc_version
1518
echo
1619
version_greater_equal "${gcc_version}" 12.3.0 || { echo "GNU C++ Compiler 12.3.0 or above is required!"; exit 1; }
1720

18-
# Install from source
19-
MAX_JOBS=8 pip install -v git+https://github.com/bigPYJ1151/vllm@PR_Branch \
21+
# Refer to https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html to install from source
22+
# We use this one-liner to install latest vllm-cpu
23+
MAX_JOBS=8 VLLM_TARGET_DEVICE=cpu pip install -v git+https://github.com/vllm-project/vllm.git \
2024
--extra-index-url https://download.pytorch.org/whl/cpu

llm_on_ray/inference/vllm_predictor.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#
1616

1717
import asyncio
18+
import os
1819
from typing import AsyncGenerator, List, Union
1920
from vllm.engine.arg_utils import AsyncEngineArgs
2021
from vllm.engine.async_llm_engine import AsyncLLMEngine
@@ -25,20 +26,25 @@
2526

2627

2728
class VllmPredictor(Predictor):
29+
VLLM_CPU_KVCACHE_SPACE_DEFAULT = 40
30+
2831
def __init__(self, infer_conf: InferenceConfig, max_num_seqs):
2932
super().__init__(infer_conf)
3033

3134
model_desc = infer_conf.model_description
3235
model_config = model_desc.config
3336
dtype = "bfloat16" if infer_conf.vllm.precision == PRECISION_BF16 else "float32"
3437

38+
# Set environment variable VLLM_CPU_KVCACHE_SPACE to control the size of the CPU key-value cache.
39+
# The default value is 40GB.
40+
os.environ["VLLM_CPU_KVCACHE_SPACE"] = str(self.VLLM_CPU_KVCACHE_SPACE_DEFAULT)
41+
3542
args = AsyncEngineArgs(
3643
model=model_desc.model_id_or_path,
3744
trust_remote_code=model_config.trust_remote_code,
3845
device=infer_conf.device,
3946
dtype=dtype,
4047
disable_log_requests=True,
41-
swap_space=40,
4248
max_num_seqs=max_num_seqs,
4349
)
4450

0 commit comments

Comments
 (0)