Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 04 26 #211

Merged
merged 107 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
eb2428e
[Test] Test multiple attn backend for chunked prefill. (#4023)
rkooo567 Apr 12, 2024
71760ce
[Bugfix] fix type hint for py 3.8 (#4036)
youkaichao Apr 12, 2024
405a695
[Misc] Fix typo in scheduler.py (#4022)
zhuohan123 Apr 12, 2024
801ad22
[mypy] Add mypy type annotation part 1 (#4006)
rkooo567 Apr 12, 2024
58911ec
[Core] fix custom allreduce default value (#4040)
youkaichao Apr 12, 2024
094013d
Fix triton compilation issue (#3984)
Bellk17 Apr 12, 2024
0b5c9ea
[Bugfix] Fix LoRA bug (#4032)
jeejeelee Apr 12, 2024
b35bba7
[CI/Test] expand ruff and yapf for all supported python version (#4037)
youkaichao Apr 13, 2024
0356684
[Bugfix] More type hint fixes for py 3.8 (#4039)
dylanwhawk Apr 13, 2024
0f5a490
[Core][Distributed] improve logging for init dist (#4042)
youkaichao Apr 13, 2024
a738567
[Bugfix] fix_log_time_in_metrics (#4050)
zspo Apr 13, 2024
5444860
[Bugfix] fix_small_bug_in_neuron_executor (#4051)
zspo Apr 13, 2024
7dd0af0
[Kernel] Add punica dimension for Baichuan-13B (#4053)
jeejeelee Apr 13, 2024
fab8ca1
[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476)
sangstar Apr 14, 2024
f39e0b5
[Core] avoid too many cuda context by caching p2p test (#4021)
youkaichao Apr 14, 2024
de26ef7
[BugFix] Fix tensorizer extra in setup.py (#4072)
njhill Apr 14, 2024
d3f28b1
[Docs] document that mixtral 8x22b is supported (#4073)
simon-mo Apr 14, 2024
0012b9b
[Misc] Upgrade triton to 2.2.0 (#4061)
esmeetu Apr 15, 2024
6bd8ad1
[Bugfix] Fix filelock version requirement (#4075)
zhuohan123 Apr 15, 2024
5c33590
[Misc][Minor] Fix CPU block num log in CPUExecutor. (#4088)
bigPYJ1151 Apr 15, 2024
3d28207
[Core] Simplifications to executor classes (#4071)
njhill Apr 15, 2024
0008bf9
[Doc] Add better clarity for tensorizer usage (#4090)
sangstar Apr 15, 2024
6800f95
[Bugfix] Fix ray workers profiling with nsight (#4095)
rickyyx Apr 15, 2024
43af0d0
[Typing] Fix Sequence type GenericAlias only available after Python 3…
rkooo567 Apr 15, 2024
f045612
[Core] Fix engine-use-ray broken (#4105)
rkooo567 Apr 16, 2024
bc92515
LM Format Enforcer Guided Decoding Support (#3868)
noamgat Apr 16, 2024
2986e80
[Core] Refactor model loading code (#4097)
Yard1 Apr 16, 2024
945a6b7
[Speculative decoding 6/9] Integrate speculative decoding with LLMEng…
cadedaniel Apr 16, 2024
a84676e
[Misc] [CI] Fix CI failure caught after merge (#4126)
cadedaniel Apr 17, 2024
f56e1ae
[CI] Move CPU/AMD tests to after wait (#4123)
cadedaniel Apr 17, 2024
0d9aabe
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024)
youkaichao Apr 17, 2024
ffd9ca8
[Bugfix] fix output parsing error for trtllm backend (#4137)
elinx Apr 17, 2024
302870b
[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134)
ucciicci Apr 17, 2024
e69ff11
[Typing] Mypy typing part 2 (#4043)
rkooo567 Apr 18, 2024
085d445
[Core] nccl integrity check and test (#4155)
youkaichao Apr 18, 2024
05086c1
Allow model to be served under multiple names (#2894)
hmellor Apr 18, 2024
d005abb
[Bugfix] Get available quantization methods from quantization registr…
mgoin Apr 18, 2024
e70ec2f
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill …
mmoskal Apr 18, 2024
6b688e3
[Docs] document that Meta Llama 3 is supported (#4175)
simon-mo Apr 18, 2024
fcd0b11
[Bugfix] Support logprobs when using guided_json and other constraine…
jamestwhedbee Apr 18, 2024
7b9df95
[Misc] Bump transformers to latest version (#4176)
njhill Apr 18, 2024
ce72466
[CI/CD] add neuron docker and ci test scripts (#3571)
liangfu Apr 18, 2024
6a44497
[Bugfix] Fix CustomAllreduce nvlink topology detection (#3974)
agt Apr 18, 2024
6928163
[Core] add an option to log every function call to for debugging hang…
youkaichao Apr 18, 2024
e726a89
Support eos_token_id from generation_config.json (#4182)
simon-mo Apr 19, 2024
643d8d1
[Bugfix] Fix LoRA loading check (#4138)
jeejeelee Apr 19, 2024
39887d5
Bump version of 0.4.1 (#4177)
simon-mo Apr 19, 2024
d857aa0
[Misc] fix docstrings (#4191)
UranusSeven Apr 19, 2024
93b20db
[Bugfix][Core] Restore logging of stats in the async engine (#4150)
ronensc Apr 19, 2024
9d0b980
[Misc] add nccl in collect env (#4211)
youkaichao Apr 19, 2024
a1fe28b
Pass `tokenizer_revision` when getting tokenizer in openai serving (#…
chiragjn Apr 20, 2024
58b01e5
[Bugfix] Add fix for JSON whitespace (#4189)
ayusher Apr 20, 2024
dc1d3f7
Fix missing docs and out of sync `EngineArgs` (#4219)
hmellor Apr 20, 2024
3809434
[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118)
comaniac Apr 20, 2024
41f9c9b
[Frontend] multiple sampling params support (#3570)
nunjunj Apr 20, 2024
59e717b
Updating lm-format-enforcer version and adding links to decoding libr…
noamgat Apr 20, 2024
e916374
Don't show default value for flags in `EngineArgs` (#4223)
hmellor Apr 21, 2024
a8a2ad6
[Doc]: Update the doc of adding new models (#4236)
YeFD Apr 21, 2024
f17bb41
Make initialization of tokenizer and detokenizer optional (#3748)
GeauxEric Apr 21, 2024
0a74885
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic an…
hongxiayang Apr 22, 2024
3ca56cf
[Core][Distributed] fix _is_full_nvlink detection (#4233)
youkaichao Apr 22, 2024
1f7027e
[Misc] Add vision language model support to CPU backend (#3968)
Isotr0py Apr 22, 2024
81e4d26
[Bugfix] Fix type annotations in CPU model runner (#4256)
WoosukKwon Apr 22, 2024
7e236d1
[Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993)
sighingnow Apr 22, 2024
0065ecb
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using t…
alexm-neuralmagic Apr 22, 2024
91e2e19
Add example scripts to documentation (#4225)
hmellor Apr 22, 2024
542dc70
[Core] Scheduler perf fix (#4270)
rkooo567 Apr 22, 2024
949c804
[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276)
Michaelvll Apr 22, 2024
2c5f365
[Core][Distributed] use absolute path for library file (#4271)
youkaichao Apr 23, 2024
9b6f4f8
Fix `autodoc` directives (#4272)
hmellor Apr 23, 2024
fccd494
[Mypy] Part 3 fix typing for nested directories for most of directory…
rkooo567 Apr 23, 2024
9e8c339
[Core] Some simplification of WorkerWrapper changes (#4183)
njhill Apr 23, 2024
1c429b4
[Core] Scheduling optimization 2 (#4280)
rkooo567 Apr 23, 2024
dd092dd
[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…
cadedaniel Apr 23, 2024
650eca0
[Bugfix] Fixing max token error message for openai compatible server …
jgordley Apr 23, 2024
328da32
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286)
DefTruth Apr 23, 2024
c2abce3
[Core][Logging] Add last frame information for better debugging (#4278)
youkaichao Apr 23, 2024
2a80bcf
[CI] Add ccache for wheel builds job (#4281)
simon-mo Apr 23, 2024
9e73227
AQLM CUDA support (#3287)
jaemzfleming Apr 23, 2024
f031047
[Bugfix][Frontend] Raise exception when file-like chat template fails…
DarkLight1337 Apr 23, 2024
a8c5a2d
[Kernel] FP8 support for MoE kernel / Mixtral (#4244)
pcmoritz Apr 24, 2024
e5b0dc8
[BUG] fixed fp8 conflict with aqlm (#4307)
robertgshaw2-neuralmagic Apr 24, 2024
16883fd
[Core][Distributed] use cpu/gloo to initialize pynccl (#4248)
youkaichao Apr 24, 2024
ffc3593
[CI][Build] change pynvml to nvidia-ml-py (#4302)
youkaichao Apr 24, 2024
cc2c2f2
[Misc] Reduce supported Punica dtypes (#4304)
WoosukKwon Apr 24, 2024
afd3970
[Core][Distributed] use existing torch.cuda.device (#4318)
youkaichao Apr 24, 2024
27ced33
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279)
ywang96 Apr 24, 2024
a110633
[Bugfix] Fix marlin kernel crash on H100 (#4218)
alexm-neuralmagic Apr 24, 2024
50f4e48
[Doc] Add note for docker user (#4340)
youkaichao Apr 24, 2024
382fb33
[Misc] Use public API in benchmark_throughput (#4300)
zifeitong Apr 24, 2024
e207f23
[Model] Adds Phi-3 support (#4298)
caiom Apr 25, 2024
b290035
[Core] Move ray_utils.py from `engine` to `executor` package (#4347)
njhill Apr 25, 2024
bd92e76
[Bugfix][Model] Refactor OLMo model to support new HF format in trans…
Isotr0py Apr 25, 2024
74e20c2
[CI/Build] Adding functionality to reset the node's GPUs before proce…
Alexei-V-Ivanov-AMD Apr 25, 2024
0f38d71
[Doc] README Phi-3 name fix. (#4372)
caiom Apr 25, 2024
fff6cd2
[Core]refactor aqlm quant ops (#4351)
jikunshang Apr 25, 2024
9bb7eff
[Mypy] Typing lora folder (#4337)
rkooo567 Apr 25, 2024
1917d86
[Misc] Fix flash attention backend log (#4368)
esmeetu Apr 25, 2024
b6d61b2
./format, fixed tests failing in automation due to ray.init()
robertgshaw2-neuralmagic Apr 26, 2024
6dcf181
fixed typo in run tests script
robertgshaw2-neuralmagic Apr 27, 2024
a8b853a
fixed sparsity issues with model loader refactor
robertgshaw2-neuralmagic Apr 30, 2024
b7fb44b
format
robertgshaw2-neuralmagic Apr 30, 2024
8177a4b
linter
robertgshaw2-neuralmagic Apr 30, 2024
96219f1
ruff ruff
robertgshaw2-neuralmagic Apr 30, 2024
16f1aa2
updated tests to skip starcoder for now
robertgshaw2-neuralmagic Apr 30, 2024
475ec0a
yapf
robertgshaw2-neuralmagic Apr 30, 2024
e5da6ba
Merge branch 'main' into upstream-sync-2024-04-26
robertgshaw2-neuralmagic Apr 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,19 @@ set -ex
# Print ROCm version
rocminfo


echo "reset" > /opt/amdgpu/etc/gpu_state

while true; do
sleep 3
if grep -q clean /opt/amdgpu/etc/gpu_state; then
echo "GPUs state is \"clean\""
break
fi
done



# Try building the docker image
docker build -t rocm -f Dockerfile.rocm .

Expand All @@ -14,7 +27,8 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device /dev/kfd --device /dev/dri --network host --name rocm rocm python3 -m vllm.entrypoints.api_server &
export HIP_VISIBLE_DEVICES=1
docker run --device /dev/kfd --device /dev/dri --network host -e HIP_VISIBLE_DEVICES --name rocm rocm python3 -m vllm.entrypoints.api_server &

# Wait for the server to start
wait_for_server_to_start() {
Expand Down
37 changes: 37 additions & 0 deletions .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# This script build the Neuron docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -e

# Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
docker build -t neuron -f Dockerfile.neuron .

# Setup cleanup
remove_docker_container() { docker rm -f neuron || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
15 changes: 13 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ steps:
command: pytest -v -s async_engine

- label: Basic Correctness Test
command: pytest -v -s basic_correctness
commands:
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py

- label: Core Test
command: pytest -v -s core
Expand All @@ -27,13 +31,14 @@ steps:
num_gpus: 2 # only support 1 or 2 for now.
commands:
- pytest -v -s test_pynccl.py
- pytest -v -s test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_chunked_prefill_distributed.py

- label: Engine Test
command: pytest -v -s engine tokenization test_sequence.py test_config.py
command: pytest -v -s engine tokenization test_sequence.py test_config.py test_logger.py

- label: Entrypoints Test
commands:
Expand Down Expand Up @@ -85,9 +90,15 @@ steps:
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 4

- label: Tensorizer Test
command: apt-get install curl libsodium23 && pytest -v -s tensorizer_loader

- label: Metrics Test
command: pytest -v -s metrics

- label: Quantization Test
command: pytest -v -s quantization

- label: Benchmarks
working_dir: "/vllm-workspace/.buildkite"
commands:
Expand Down
20 changes: 13 additions & 7 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,6 @@
{% set default_working_dir = "/vllm-workspace/tests" %}

steps:
- label: "AMD Test"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh

- label: "CPU Test"
command: bash .buildkite/run-cpu-test.sh

- label: ":docker: build image"
commands:
Expand All @@ -23,6 +16,19 @@ steps:
limit: 5
- wait

- label: "AMD Test"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh

- label: "Neuron Test"
agents:
queue: neuron
command: bash .buildkite/run-neuron-test.sh

- label: "CPU Test"
command: bash .buildkite/run-cpu-test.sh

{% for step in steps %}
- label: "{{ step.label }}"
agents:
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/200-installation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/300-usage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down
3 changes: 3 additions & 0 deletions .github/ISSUE_TEMPLATE/400-bug report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down Expand Up @@ -57,6 +58,8 @@ body:
If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.

If you experienced crashes or hangs, it would be helpful to run vllm with `export VLLM_TRACE_FUNCTION=1` . All the function calls in vllm will be recorded. Inspect these log files, and tell which function crashes or hangs.
placeholder: |
A clear and concise description of what the bug is.

Expand Down
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/700-performance discussion.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ body:
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
It is suggested to download and execute the latest script, as vllm might frequently update the diagnosis information needed for accurately and quickly responding to issues.
value: |
```text
The output of `python collect_env.py`
Expand Down
4 changes: 2 additions & 2 deletions .github/scripts/run-tests
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,8 @@ do
# need to be run with specific options
if [[ "${TEST}" == *"kernels"* || "${TEST}" == *"samplers"* ]]; then
CUDA_VISIBLE_DEVICES=0,1 pytest ${CC_PYTEST_FLAGS} --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
elif [[ "${TEST}" == *"test_basic_distributed_correctness"* ]]; then
CUDA_VISIBLE_DEVICES=0,1 TEST_DIST_MODEL=facebook/opt-125m pytest ${CC_PYTEST_FLAGS} --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
elif [[ "${TEST}" == *"distributed"* ]]; then
CUDA_VISIBLE_DEVICES=0,1 pytest ${CC_PYTEST_FLAGS} --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
mgoin marked this conversation as resolved.
Show resolved Hide resolved
elif [[ "${TEST}" == *"test_models_logprobs"* ]]; then
pytest --forked ${CC_PYTEST_FLAGS} --junitxml=${RESULT_XML} ${TEST} || LOCAL_SUCCESS=$?
else
Expand Down
50 changes: 50 additions & 0 deletions .github/workflows/mypy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: mypy

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
pull_request:
branches:
- main

jobs:
ruff:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install mypy==1.9.0
pip install types-setuptools
pip install types-PyYAML
pip install types-requests
pip install types-setuptools
- name: Mypy
run: |
mypy vllm/attention --config-file pyproject.toml
mypy vllm/distributed --config-file pyproject.toml
mypy vllm/entrypoints --config-file pyproject.toml
mypy vllm/executor --config-file pyproject.toml
mypy vllm/usage --config-file pyproject.toml
mypy vllm/*.py --config-file pyproject.toml
mypy vllm/transformers_utils --config-file pyproject.toml
mypy vllm/engine --config-file pyproject.toml
mypy vllm/worker --config-file pyproject.toml
mypy vllm/spec_decode --config-file pyproject.toml
mypy vllm/lora --config-file pyproject.toml

# TODO(sang): Fix nested dir
mypy vllm/model_executor/*.py --config-file pyproject.toml
mypy vllm/core/*.py --follow-imports=skip --config-file pyproject.toml

3 changes: 3 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ jobs:
- name: Checkout
uses: actions/checkout@v3

- name: Setup ccache
uses: hendrikmuhs/ccache-action@v1.2

- name: Set up Linux Env
if: ${{ runner.os == 'Linux' }}
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ruff.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
python-version: ["3.8", "3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/yapf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
python-version: ["3.8", "3.9", "3.10", "3.11"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ instance/

# Sphinx documentation
docs/_build/
docs/source/getting_started/examples/*.rst
!**/*.template.rst

# PyBuilder
.pybuilder/
Expand Down
14 changes: 2 additions & 12 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -167,12 +167,14 @@ set(VLLM_EXT_SRC
"csrc/layernorm_kernels.cu"
"csrc/quantization/squeezellm/quant_cuda_kernel.cu"
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/fp8/fp8_cuda_kernels.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/moe_align_block_size_kernels.cu"
"csrc/pybind.cpp")

if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_EXT_SRC
"csrc/quantization/aqlm/gemm_kernels.cu"
"csrc/quantization/awq/gemm_kernels.cu"
"csrc/quantization/marlin/marlin_cuda_kernel.cu"
"csrc/custom_all_reduce.cu")
Expand Down Expand Up @@ -210,23 +212,11 @@ define_gpu_extension_target(

set(VLLM_PUNICA_EXT_SRC
"csrc/punica/bgmv/bgmv_bf16_bf16_bf16.cu"
"csrc/punica/bgmv/bgmv_bf16_bf16_fp16.cu"
"csrc/punica/bgmv/bgmv_bf16_fp16_bf16.cu"
"csrc/punica/bgmv/bgmv_bf16_fp16_fp16.cu"
"csrc/punica/bgmv/bgmv_bf16_fp32_bf16.cu"
"csrc/punica/bgmv/bgmv_bf16_fp32_fp16.cu"
"csrc/punica/bgmv/bgmv_fp16_bf16_bf16.cu"
"csrc/punica/bgmv/bgmv_fp16_bf16_fp16.cu"
"csrc/punica/bgmv/bgmv_fp16_fp16_bf16.cu"
"csrc/punica/bgmv/bgmv_fp16_fp16_fp16.cu"
"csrc/punica/bgmv/bgmv_fp16_fp32_bf16.cu"
"csrc/punica/bgmv/bgmv_fp16_fp32_fp16.cu"
"csrc/punica/bgmv/bgmv_fp32_bf16_bf16.cu"
"csrc/punica/bgmv/bgmv_fp32_bf16_fp16.cu"
"csrc/punica/bgmv/bgmv_fp32_fp16_bf16.cu"
"csrc/punica/bgmv/bgmv_fp32_fp16_fp16.cu"
"csrc/punica/bgmv/bgmv_fp32_fp32_bf16.cu"
"csrc/punica/bgmv/bgmv_fp32_fp32_fp16.cu"
"csrc/punica/punica_ops.cc")

#
Expand Down
36 changes: 36 additions & 0 deletions Dockerfile.neuron
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# default base image
ARG BASE_IMAGE="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuronx:2.1.1-neuronx-py310-sdk2.17.0-ubuntu20.04"

FROM $BASE_IMAGE

RUN echo "Base image is $BASE_IMAGE"

# Install some basic utilities
RUN apt-get update && apt-get install python3 python3-pip -y

### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}

RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.12.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U

COPY ./vllm /app/vllm/vllm
COPY ./setup.py /app/vllm/setup.py
COPY ./requirements-common.txt /app/vllm/requirements-common.txt
COPY ./requirements-neuron.txt /app/vllm/requirements-neuron.txt

RUN cd /app/vllm \
&& python3 -m pip install -U -r requirements-neuron.txt

ENV VLLM_BUILD_WITH_NEURON 1
RUN cd /app/vllm \
&& pip install -e . \
&& cd ..

CMD ["/bin/bash"]
5 changes: 1 addition & 4 deletions Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ RUN echo "Base image is $BASE_IMAGE"
ARG FA_GFX_ARCHS="gfx90a;gfx942"
RUN echo "FA_GFX_ARCHS is $FA_GFX_ARCHS"

ARG FA_BRANCH="3d2b6f5"
ARG FA_BRANCH="ae7928c"
RUN echo "FA_BRANCH is $FA_BRANCH"

# whether to build flash-attention
Expand Down Expand Up @@ -92,13 +92,10 @@ RUN if [ "$BUILD_TRITON" = "1" ]; then \
COPY ./ /app/vllm

RUN python3 -m pip install --upgrade pip numba
RUN python3 -m pip install xformers==0.0.23 --no-deps

RUN cd /app \
&& cd vllm \
&& pip install -U -r requirements-rocm.txt \
&& if [ "$BUILD_FA" = "1" ]; then \
bash patch_xformers.rocm.sh; fi \
&& patch /opt/rocm/include/hip/amd_detail/amd_hip_bf16.h /app/vllm/rocm_patch/rocm_bf16.patch \
&& python3 setup.py install \
&& cd ..
Expand Down
Loading
Loading