Skip to content

Update vllm to use latest upstream to support CPU #179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
8fce284
update vllm to use upstream v0.4.0.post1
xwu-intel Apr 8, 2024
7118577
nit
xwu-intel Apr 8, 2024
8100b74
adjust watch list
xwu-intel Apr 8, 2024
9a399ad
Add llm_on_ray package installation and set CPU key-value cache size
xwu-intel Apr 8, 2024
e50e8fb
Remove device=infer_conf.device and add comment explaining why
xwu-intel Apr 8, 2024
9ee1eb7
Update VLLM installation script to use main commit
xwu-intel Apr 9, 2024
ad317d3
Merge branch 'main' of https://github.com/intel/llm-on-ray into updat…
xwu-intel Apr 9, 2024
6a56e1a
Update GCC version detection in install-vllm-cpu.sh script
xwu-intel Apr 9, 2024
a983b2b
Update vllm-cpu installation method
xwu-intel Apr 26, 2024
1c653ca
Fix Docker build command and update YAML configuration files
xwu-intel Apr 28, 2024
a64bbf2
Merge branch 'main' of https://github.com/intel/llm-on-ray into updat…
xwu-intel Apr 28, 2024
870bed7
Add VLLM_CPU_KVCACHE_SPACE_DEFAULT constant to control the size of th…
xwu-intel Apr 28, 2024
df317f6
update
xwu-intel Apr 28, 2024
f9c3945
nit
xwu-intel Apr 28, 2024
c227f98
Update default value of VLLM_CPU_KVCACHE_SPACE to 40GB
xwu-intel Apr 28, 2024
5d0adc2
Fix indentation in workflow_inference.yml
xwu-intel Apr 28, 2024
e220f76
debug
xwu-intel Apr 28, 2024
1a8cb7c
debug
xwu-intel Apr 28, 2024
9b342bf
debug
xwu-intel Apr 28, 2024
ce59387
nit
xwu-intel Apr 28, 2024
cc5c75c
nit
xwu-intel Apr 28, 2024
dc13542
debug
xwu-intel Apr 28, 2024
96dcdc0
Enable non-gated and gated models access
xwu-intel Apr 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/workflow_finetune.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ jobs:

- name: Build Docker Image
run: |
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile.cpu_and_deepspeed -t finetune:latest
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile.cpu_and_deepspeed -t finetune:latest
docker container prune -f
docker image prune -f

Expand Down
30 changes: 3 additions & 27 deletions .github/workflows/workflow_inference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ jobs:
DF_SUFFIX=".cpu_and_deepspeed"
fi
TARGET=${{steps.target.outputs.target}}
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile${DF_SUFFIX} -t ${TARGET}:latest
docker build ./ --build-arg CACHEBUST=1 --build-arg http_proxy=${{ inputs.http_proxy }} --build-arg https_proxy=${{ inputs.https_proxy }} -f dev/docker/Dockerfile${DF_SUFFIX} -t ${TARGET}:latest
docker container prune -f
docker image prune -f

Expand All @@ -118,32 +118,8 @@ jobs:
- name: Run Inference Test
run: |
TARGET=${{steps.target.outputs.target}}
CMD=$(cat << EOF
import yaml
if ("${{ matrix.model }}" == "starcoder"):
conf_path = "llm_on_ray/inference/models/starcoder.yaml"
with open(conf_path, encoding="utf-8") as reader:
result = yaml.load(reader, Loader=yaml.FullLoader)
result['model_description']["config"]["use_auth_token"] = "${{ env.HF_ACCESS_TOKEN }}"
with open(conf_path, 'w') as output:
yaml.dump(result, output, sort_keys=False)
if ("${{ matrix.model }}" == "llama-2-7b-chat-hf"):
conf_path = "llm_on_ray/inference/models/llama-2-7b-chat-hf.yaml"
with open(conf_path, encoding="utf-8") as reader:
result = yaml.load(reader, Loader=yaml.FullLoader)
result['model_description']["config"]["use_auth_token"] = "${{ env.HF_ACCESS_TOKEN }}"
with open(conf_path, 'w') as output:
yaml.dump(result, output, sort_keys=False)
if ("${{ matrix.model }}" == "gemma-2b"):
conf_path = "llm_on_ray/inference/models/gemma-2b.yaml"
with open(conf_path, encoding="utf-8") as reader:
result = yaml.load(reader, Loader=yaml.FullLoader)
result['model_description']["config"]["use_auth_token"] = "${{ env.HF_ACCESS_TOKEN }}"
with open(conf_path, 'w') as output:
yaml.dump(result, output, sort_keys=False)
EOF
)
docker exec "${TARGET}" python -c "$CMD"
# Enable non-gated and gated models access
docker exec "${TARGET}" bash -c "huggingface-cli login --token ${{ env.HF_ACCESS_TOKEN }}"
if [[ ${{ matrix.model }} == "mpt-7b-ipex-llm" ]]; then
docker exec "${TARGET}" bash -c "llm_on_ray-serve --config_file llm_on_ray/inference/models/ipex-llm/mpt-7b-ipex-llm.yaml --simple"
elif [[ ${{ matrix.model }} == "llama-2-7b-chat-hf-vllm" ]]; then
Expand Down
12 changes: 2 additions & 10 deletions .github/workflows/workflow_orders_on_merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,8 @@ on:
branches:
- main
paths:
- '.github/**'
- 'docker/**'
- 'dev/docker/**'
- 'llm_on_ray/common/**'
- 'llm_on_ray/finetune/**'
- 'llm_on_ray/inference/**'
- 'llm_on_ray/rlhf/**'
- 'tools/**'
- 'pyproject.toml'
- 'tests/**'
- '**'
- '!*.md'

jobs:
Lint:
Expand Down
12 changes: 2 additions & 10 deletions .github/workflows/workflow_orders_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,8 @@ on:
branches:
- main
paths:
- '.github/**'
- 'docker/**'
- 'dev/docker/**'
- 'llm_on_ray/common/**'
- 'llm_on_ray/finetune/**'
- 'llm_on_ray/inference/**'
- 'llm_on_ray/rlhf/**'
- 'tools/**'
- 'pyproject.toml'
- 'tests/**'
- '**'
- '!*.md'

jobs:

Expand Down
12 changes: 6 additions & 6 deletions dev/docker/Dockerfile.vllm
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,14 @@ COPY ./pyproject.toml .
COPY ./MANIFEST.in .
COPY ./dev/scripts/install-vllm-cpu.sh .

# create llm_on_ray package directory to bypass the following 'pip install -e' command
RUN mkdir ./llm_on_ray

RUN --mount=type=cache,target=/root/.cache/pip pip install -e .[cpu] --extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/

# Install vllm-cpu
# Activate base first for loading g++ envs ($CONDA_PREFIX/etc/conda/activate.d/*)
RUN --mount=type=cache,target=/root/.cache/pip \
source /opt/conda/bin/activate base && ./install-vllm-cpu.sh

# Install llm_on_ray
# Create llm_on_ray package directory to bypass the following 'pip install -e' command
RUN mkdir ./llm_on_ray
RUN --mount=type=cache,target=/root/.cache/pip pip install -e .[cpu] --extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/

12 changes: 8 additions & 4 deletions dev/scripts/install-vllm-cpu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,21 @@
[[ -n $(which g++) ]] || { echo "GNU C++ Compiler (g++) is not found!"; exit 1; }
[[ -n $(which pip) ]] || { echo "pip command is not found!"; exit 1; }

# g++ version should be >=12.3
# g++ version should be >=12.3. On Ubuntu 22.4, you can run:
# sudo apt-get update -y
# sudo apt-get install -y gcc-12 g++-12
# sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
version_greater_equal()
{
printf '%s\n%s\n' "$2" "$1" | sort --check=quiet --version-sort
}
gcc_version=$(g++ -dumpversion)
gcc_version=$(g++ --version | grep -o -E '[0-9]+\.[0-9]+\.[0-9]+' | head -n1)
echo
echo Current GNU C++ Compiler version: $gcc_version
echo
version_greater_equal "${gcc_version}" 12.3.0 || { echo "GNU C++ Compiler 12.3.0 or above is required!"; exit 1; }

# Install from source
MAX_JOBS=8 pip install -v git+https://github.com/bigPYJ1151/vllm@PR_Branch \
# Refer to https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html to install from source
# We use this one-liner to install latest vllm-cpu
MAX_JOBS=8 VLLM_TARGET_DEVICE=cpu pip install -v git+https://github.com/vllm-project/vllm.git \
--extra-index-url https://download.pytorch.org/whl/cpu
8 changes: 7 additions & 1 deletion llm_on_ray/inference/vllm_predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#

import asyncio
import os
from typing import AsyncGenerator, List, Union
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
Expand All @@ -25,20 +26,25 @@


class VllmPredictor(Predictor):
VLLM_CPU_KVCACHE_SPACE_DEFAULT = 40

def __init__(self, infer_conf: InferenceConfig, max_num_seqs):
super().__init__(infer_conf)

model_desc = infer_conf.model_description
model_config = model_desc.config
dtype = "bfloat16" if infer_conf.vllm.precision == PRECISION_BF16 else "float32"

# Set environment variable VLLM_CPU_KVCACHE_SPACE to control the size of the CPU key-value cache.
# The default value is 40GB.
os.environ["VLLM_CPU_KVCACHE_SPACE"] = str(self.VLLM_CPU_KVCACHE_SPACE_DEFAULT)

args = AsyncEngineArgs(
model=model_desc.model_id_or_path,
trust_remote_code=model_config.trust_remote_code,
device=infer_conf.device,
dtype=dtype,
disable_log_requests=True,
swap_space=40,
max_num_seqs=max_num_seqs,
)

Expand Down
Loading