Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend #3814

Merged
merged 68 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
792736d
add build, dependency
jikunshang Apr 11, 2024
f762493
add python layer support for xpu
jikunshang Apr 11, 2024
068d34b
add ipex ops
jikunshang Apr 11, 2024
19bc177
add test
jikunshang Apr 11, 2024
df77d6f
revert prepare_prompt
jikunshang Apr 12, 2024
a5f2c85
revert prefill
jikunshang Apr 12, 2024
1c3a527
fix
jikunshang Apr 12, 2024
feb6d66
fix alibi device support
jikunshang Apr 15, 2024
607c46e
remove
jikunshang Apr 15, 2024
09d0382
add tensorizer config, fix format
jikunshang Apr 15, 2024
ef280e0
update wheel, fix typo
jikunshang Apr 16, 2024
d5f3e1f
fix
jikunshang Apr 16, 2024
6b5f58c
fix xpu_executor
jikunshang Apr 16, 2024
27e2dcf
more fix
jikunshang Apr 16, 2024
7a6e6cd
typo
jikunshang Apr 16, 2024
a63dbf8
add co-author
jikunshang Apr 16, 2024
0acfe75
revert test, fix format
jikunshang Apr 16, 2024
58eafd0
fix ray_xpu_executor
jikunshang Apr 16, 2024
bc45bce
use varlen_fwd
jikunshang Apr 17, 2024
56b0016
use varlen_attention
jikunshang Apr 17, 2024
ded32a2
fix
jikunshang Apr 18, 2024
25b368b
remove paading
jikunshang Apr 18, 2024
3519897
avoid using page attention v2 for ipex
jikunshang Apr 19, 2024
36fae83
refactor
jikunshang Apr 19, 2024
8514432
minor
jikunshang Apr 19, 2024
e1a42da
add xpu test
jikunshang Apr 19, 2024
917b74a
use oneapi base docker imahe
jikunshang Apr 19, 2024
80961f7
format
jikunshang Apr 19, 2024
9037564
rebase, remove some config
jikunshang Apr 19, 2024
4e3d1ed
add LoadConfig
jikunshang Apr 19, 2024
e42f23a
fix execute_model
jikunshang Apr 19, 2024
8d9ef99
use v2
jikunshang Apr 19, 2024
d4dd31e
revert torch sdpa cpu path
jikunshang Apr 22, 2024
b4ca330
fix sdpa split cache on cpu path
jikunshang Apr 23, 2024
eaec862
add vision model support
jikunshang Apr 23, 2024
7d76334
fix ray xpu executor
jikunshang Apr 23, 2024
ce55b60
fix block table device
jikunshang Apr 23, 2024
39c07d9
add intel xpu test
jikunshang Apr 25, 2024
88d1b6e
address comments
jikunshang Apr 29, 2024
e00fbce
fix import, add doc
jikunshang Apr 29, 2024
50719c4
fix doc
jikunshang Apr 30, 2024
342ea72
fix
jikunshang May 1, 2024
b3231b7
format
jikunshang May 1, 2024
765fc2e
fix rebase issues
jikunshang May 6, 2024
ba7c162
fix ray_xpu_executor
jikunshang May 6, 2024
4d0ab33
fix rebase issue, copy/swap_blocks
jikunshang May 10, 2024
dc4d41a
fix format
jikunshang May 10, 2024
f505011
add xpu in benchmark_latency.py
abhilash1910 May 14, 2024
d23aec6
fix
jikunshang May 16, 2024
2d86f22
fix format
jikunshang May 16, 2024
fc0a8b8
address comments
jikunshang May 16, 2024
07c139b
fix tp issues
jikunshang May 16, 2024
634c951
fix worker
jikunshang May 16, 2024
f0e6407
fix ray xpu executor
jikunshang May 23, 2024
6871d55
fix due to code rebase
jikunshang Jun 3, 2024
bef2c78
setuptools version
jikunshang Jun 3, 2024
f037737
update docker file, due to public key expired.
jikunshang Jun 4, 2024
84f6b3a
add RayXPUExecutorAsync for serving
jikunshang Jun 6, 2024
a1f2970
update _custom_ops.py
jikunshang Jun 7, 2024
ca88270
add ipex_attn backend
jikunshang Jun 13, 2024
f1ebe9f
revert torch sdpa backend
jikunshang Jun 13, 2024
9046315
update document
jikunshang Jun 13, 2024
ebbc13e
format
jikunshang Jun 13, 2024
5d823d9
more fix
jikunshang Jun 13, 2024
bdb6ca5
revert torch sdpa, fix doc
jikunshang Jun 14, 2024
bcdf65a
address comments
jikunshang Jun 17, 2024
10ec2d2
remove fuse in ipex_attn backend
jikunshang Jun 17, 2024
b4fef36
Merge branch 'main' into xpu_0403
WoosukKwon Jun 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t xpu-test -f Dockerfile.xpu .

# Setup cleanup
remove_docker_container() { docker rm -f xpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path xpu-test python3 examples/offline_inference.py
5 changes: 5 additions & 0 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ steps:
queue: intel
command: bash .buildkite/run-cpu-test.sh

- label: "XPU Test"
agents:
queue: intel
command: bash .buildkite/run-xpu-test.sh

{% for step in steps %}
- label: "{{ step.label }}"
agents:
Expand Down
22 changes: 22 additions & 0 deletions Dockerfile.xpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM intel/oneapi-basekit:2024.1.0-devel-ubuntu22.04

RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \
rm /etc/apt/sources.list.d/intel-graphics.list && \
wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
chmod 644 /usr/share/keyrings/intel-graphics.gpg

RUN apt-get update -y \
&& apt-get install -y curl libicu70 lsb-release git wget vim numactl python3 python3-pip

COPY ./ /workspace/vllm

WORKDIR /workspace/vllm

RUN pip install -v -r requirements-xpu.txt

RUN VLLM_TARGET_DEVICE=xpu python3 setup.py install

CMD ["/bin/bash"]
2 changes: 1 addition & 1 deletion benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ def run_to_completion(profile_dir: Optional[str] = None):
"--device",
type=str,
default="cuda",
choices=["cuda", "cpu", "tpu"],
choices=["cuda", "cpu", "tpu", "xpu"],
help='device type for vLLM execution, supporting CUDA and CPU.')
parser.add_argument('--block-size',
type=int,
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/benchmark_throughput.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the following warning messages while running the benchmark:

2024:04:26-09:56:50:( 1947) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices

Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ccl/tp issue, still wip. will not influence single card case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jikunshang Do you have an update on this?

Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@ def main(args: argparse.Namespace):
"--device",
type=str,
default="cuda",
choices=["cuda", "cpu", "tpu"],
choices=["cuda", "cpu", "tpu", "xpu"],
help='device type for vLLM execution, supporting CUDA and CPU.')
parser.add_argument(
"--enable-prefix-caching",
Expand Down
61 changes: 61 additions & 0 deletions docs/source/getting_started/xpu-installation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
.. _installation_xpu:

Installation with XPU
========================

vLLM initially supports basic model inferencing and serving on Intel GPU platform.

Table of contents:

#. :ref:`Requirements <xpu_backend_requirements>`
#. :ref:`Quick start using Dockerfile <xpu_backend_quick_start_dockerfile>`
#. :ref:`Build from source <build_xpu_backend_from_source>`

.. _xpu_backend_requirements:

Requirements
------------

* OS: Linux
* Supported Hardware: Intel Data Center GPU (Intel ARC GPU WIP)
* OneAPI requirements: oneAPI 2024.1

.. _xpu_backend_quick_start_dockerfile:

Quick start using Dockerfile
----------------------------

.. code-block:: console

$ docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
$ docker run -it \
--rm \
--network=host \
--device /dev/dri \
-v /dev/dri/by-path:/dev/dri/by-path \
vllm-xpu-env

.. _build_xpu_backend_from_source:

Build from source
-----------------

- First, install required driver and intel OneAPI 2024.1.

- Second, install Python packages for vLLM XPU backend building:

.. code-block:: console

$ pip install --upgrade pip
$ pip install -v -r requirements-xpu.txt

- Finally, build and install vLLM XPU backend:

.. code-block:: console

$ VLLM_TARGET_DEVICE=xpu python setup.py install

.. note::
- FP16 is the default data type in the current XPU backend. The BF16 data
type will be supported in the future.

1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Documentation
getting_started/cpu-installation
getting_started/neuron-installation
getting_started/tpu-installation
getting_started/xpu-installation
getting_started/quickstart
getting_started/debugging
getting_started/examples/examples_index
Expand Down
11 changes: 11 additions & 0 deletions requirements-xpu.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Common dependencies
-r requirements-common.txt

setuptools < 70.0.0 # IPEX's torch have some dependency. to be removed.

torch @ https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/xpu/torch-2.1.0.post1%2Bcxx11.abi-cp310-cp310-linux_x86_64.whl
intel_extension_for_pytorch @ https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_dev/xpu/intel_extension_for_pytorch-2.1.30a0-cp310-cp310-linux_x86_64.whl
oneccl_bind_pt @ https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/oneccl_bind_pt-2.1.200%2Bxpu-cp310-cp310-linux_x86_64.whl

triton @ https://github.com/intel/intel-xpu-backend-for-triton/releases/download/v2.1.0/triton-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

8 changes: 8 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,10 @@ def _is_cpu() -> bool:
return VLLM_TARGET_DEVICE == "cpu"


def _is_xpu() -> bool:
return VLLM_TARGET_DEVICE == "xpu"


def _build_custom_ops() -> bool:
return _is_cuda() or _is_hip() or _is_cpu()

Expand Down Expand Up @@ -337,6 +341,8 @@ def get_vllm_version() -> str:
version += "+tpu"
elif _is_cpu():
version += "+cpu"
elif _is_xpu():
version += "+xpu"
else:
raise RuntimeError("Unknown runtime environment")

Expand Down Expand Up @@ -386,6 +392,8 @@ def _read_requirements(filename: str) -> List[str]:
requirements = _read_requirements("requirements-tpu.txt")
elif _is_cpu():
requirements = _read_requirements("requirements-cpu.txt")
elif _is_xpu():
requirements = _read_requirements("requirements-xpu.txt")
else:
raise ValueError(
"Unsupported platform, please use CUDA, ROCm, Neuron, or CPU.")
Expand Down
3 changes: 2 additions & 1 deletion vllm/_custom_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -373,7 +373,8 @@ def reshape_and_cache_flash(
kv_cache_dtype)


def copy_blocks(key_caches: torch.Tensor, value_caches: torch.Tensor,
def copy_blocks(key_caches: List[torch.Tensor],
value_caches: List[torch.Tensor],
block_mapping: torch.Tensor) -> None:
torch.ops._C_cache_ops.copy_blocks(key_caches, value_caches, block_mapping)

Expand Down
Loading
Loading