Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Commit

Permalink
upstream merge sync 2024-03-11 (#108)
Browse files Browse the repository at this point in the history
SUMMARY:
* upstream merge (sync) up to `657061fdced8a33a60c1b09f5da2525de9da8f03`
* some minor changes related to `ruff` and `yapf`

NOTES: we are now consistently getting out memory of errors when running
`tests/models/test_marlin.py`. i've disabled the test and created an
ASANA ticket to track the issue.

TEST PLAN:
runs on remote push

---------

Signed-off-by: Tao He <sighingnow@gmail.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: Mustafa Eyceoz <maxusmusti@gmail.com>
Co-authored-by: Roy <jasonailu87@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Massimiliano Pronesti <massimiliano.pronesti@gmail.com>
Co-authored-by: 44670 <44670@users.noreply.github.com>
Co-authored-by: zhaoyang-star <zhaoyangstar@foxmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Jared Moore <27744679+jlcmoore@users.noreply.github.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: 张大成 <1345739055@qq.com>
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com>
Co-authored-by: Jingru <niejingru@hotmail.com>
Co-authored-by: Dylan Hawk <51147702+dylanwhawk@users.noreply.github.com>
Co-authored-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Ganesh Jagadeesan <ganesh.jcs@gmail.com>
Co-authored-by: Allen.Dou <allen.dou@hotmail.com>
Co-authored-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: CHU Tianxiang <tianxiang.ctx@alibaba-inc.com>
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
Co-authored-by: Seonghyeon <seonghyeon.drew@gmail.com>
Co-authored-by: Billy Cao <aliencaocao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: felixzhu555 <79335195+felixzhu555@users.noreply.github.com>
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Sherry <503147114@qq.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Huarong <huohuarong@gmail.com>
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: cloudhan <cloudhan@outlook.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Jason Cox <jason@jasonacox.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: TianYu GUO <guoty9@mail2.sysu.edu.cn>
Co-authored-by: Jialun Lyu <43287111+pian13131@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
Co-authored-by: Chen Wang <Chen.Wang1@ibm.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: lcskrishna <lollachaitanya@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com>
Co-authored-by: TechxGenus <jianghao0728@mail.ustc.edu.cn>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: whyiug <whyiug@hotmail.com>
Co-authored-by: Terry <149540247+tterrysun@users.noreply.github.com>
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
Co-authored-by: andy-neuma <andy@neuralmagic.com>
  • Loading branch information
Show file tree
Hide file tree
Showing 126 changed files with 5,123 additions and 813 deletions.
8 changes: 7 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ steps:

- label: Basic Correctness Test
command: pytest -v -s --forked basic_correctness

- label: Core Test
command: pytest -v -s core

- label: Distributed Comm Ops Test
command: pytest -v -s --forked test_comm_ops.py
Expand All @@ -25,7 +28,7 @@ steps:
num_gpus: 2 # only support 1 or 2 for now.

- label: Engine Test
command: pytest -v -s engine
command: pytest -v -s engine test_sequence.py

- label: Entrypoints Test
command: pytest -v -s entrypoints
Expand All @@ -49,6 +52,9 @@ steps:
- label: Worker Test
command: pytest -v -s worker

- label: Speculative decoding tests
command: pytest -v -s spec_decode

- label: LoRA Test
command: pytest -v -s lora --forked

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/remote-push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
uses: ./.github/workflows/build-test.yml
with:
label: aws-avx2-192G-4-a10g-96G
timeout: 180
timeout: 240
gitref: '${{ github.ref }}'
Gi_per_thread: 4
python: ${{ matrix.python }}
Expand Down
30 changes: 25 additions & 5 deletions Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ RUN echo "FA_BRANCH is $FA_BRANCH"
# In that case, we need to use the python reference attention implementation in vllm
ARG BUILD_FA="1"

# whether to build cupy on rocm
ARG BUILD_CUPY="1"

# Install some basic utilities
RUN apt-get update && apt-get install python3 python3-pip -y

Expand Down Expand Up @@ -70,16 +73,33 @@ RUN if [ "$BUILD_FA" = "1" ]; then \
&& cd ..; \
fi

COPY ./ /app/vllm

RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install xformers==0.0.23 --no-deps

# Error related to odd state for numpy 1.20.3 where there is no METADATA etc, but an extra LICENSES_bundled.txt.
# Manually removed it so that later steps of numpy upgrade can continue
RUN if [ "$BASE_IMAGE" = "rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1" ]; then \
rm -rf /opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy-1.20.3.dist-info/; fi

# build cupy
RUN if [ "$BUILD_CUPY" = "1" ]; then \
mkdir -p libs \
&& cd libs \
&& git clone -b hipgraph_enablement --recursive https://github.com/ROCm/cupy.git \
&& cd cupy \
&& pip install mpi4py-mpich \
&& pip install scipy==1.9.3 \
&& pip install cython==0.29.* \
&& env CC=$MPI_HOME/bin/mpicc python -m pip install mpi4py \
&& export CUPY_INSTALL_USE_HIP=1 \
&& export ROCM_HOME=/opt/rocm \
&& export HCC_AMDGPU_TARGET="gfx90a,gfx942,gfx1100" \
&& pip install . \
&& cd ..; \
fi

COPY ./ /app/vllm

RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install xformers==0.0.23 --no-deps

RUN cd /app \
&& cd vllm \
&& pip install -U -r requirements-rocm.txt \
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ pip install -e .

## Quickstart

Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing).
Neural Magic maintains a variety of sparse models on our Hugging Face organization profiles, [neuralmagic](https://huggingface.co/neuralmagic) and [nm-testing](https://huggingface.co/nm-testing).

A collection of ready-to-use SparseGPT and GPTQ models in inference optimized marlin format are [available on Hugging Face](https://huggingface.co/collections/neuralmagic/compressed-llms-for-nm-vllm-65e73e3d51d3200e34b77431)

Expand Down Expand Up @@ -63,7 +63,7 @@ For a quick demonstration, here's how to run a small [50% sparse llama2-110M](ht
from vllm import LLM, SamplingParams

model = LLM(
"neuralmagic/llama2.c-stories110M-pruned50",
"neuralmagic/llama2.c-stories110M-pruned50",
sparsity="sparse_w16a16", # If left off, model will be loaded as dense
)

Expand Down
70 changes: 70 additions & 0 deletions benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,10 +277,80 @@ async def async_request_openai_completions(
return output


async def async_request_openai_chat_completions(
request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
api_url = request_func_input.api_url
assert api_url.endswith(
"v1/chat/completions"
), "OpenAI Chat API URL must end with 'v1/chat/completions'."

async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert not request_func_input.use_beam_search
payload = {
"model": request_func_input.model,
"messages": [
{
"role": "user",
"content": request_func_input.prompt,
},
],
"temperature": 0.0,
"max_tokens": request_func_input.output_len,
"stream": True,
}
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}

output = RequestFuncOutput()
output.prompt_len = request_func_input.prompt_len

generated_text = ""
ttft = 0
st = time.perf_counter()
try:
async with session.post(url=api_url, json=payload,
headers=headers) as response:
if response.status == 200:
async for chunk in response.content:
if ttft == 0:
ttft = time.perf_counter() - st
output.ttft = ttft

chunk = chunk.strip()
if not chunk:
continue

chunk = chunk.decode("utf-8").lstrip("data: ")
if chunk == "[DONE]":
latency = time.perf_counter() - st
else:
body = json.loads(chunk)
if "content" in body["choices"][0]["delta"]:
generated_text += body["choices"][0]["delta"][
"content"]

output.generated_text = generated_text
output.success = True
output.latency = latency
else:
output.success = False
except (aiohttp.ClientOSError, aiohttp.ServerDisconnectedError):
output.success = False

if pbar:
pbar.update(1)
return output


ASYNC_REQUEST_FUNCS = {
"tgi": async_request_tgi,
"vllm": async_request_vllm,
"deepspeed-mii": async_request_deepspeed_mii,
"openai": async_request_openai_completions,
"openai-chat": async_request_openai_chat_completions,
"tensorrt-llm": async_request_trt_llm,
}
6 changes: 3 additions & 3 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
On the client side, run:
python benchmarks/benchmark_serving.py \
--backend <backend> \
--tokenizer <your_model> --dataset <target_dataset> \
--model <your_model> --dataset <target_dataset> \
--request-rate <request_rate>
"""
import argparse
Expand Down Expand Up @@ -171,10 +171,10 @@ async def benchmark(
else:
raise ValueError(f"Unknown backend: {backend}")

pbar = None if disable_tqdm else tqdm(total=len(input_requests))

print(f"Traffic request rate: {request_rate}")

pbar = None if disable_tqdm else tqdm(total=len(input_requests))

benchmark_start_time = time.perf_counter()
tasks = []
async for request in get_request(input_requests, request_rate):
Expand Down
8 changes: 0 additions & 8 deletions csrc/attention/attention_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,6 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#ifdef USE_ROCM
#include <hip/hip_runtime.h>
#endif

#include <torch/extension.h>
#include <ATen/cuda/CUDAContext.h>
Expand All @@ -31,11 +28,6 @@

#include <algorithm>

#ifndef USE_ROCM
#define WARP_SIZE 32
#else
#define WARP_SIZE warpSize
#endif
#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define MIN(a, b) ((a) < (b) ? (a) : (b))
#define DIVIDE_ROUND_UP(a, b) (((a) + (b) - 1) / (b))
Expand Down
10 changes: 10 additions & 0 deletions csrc/cuda_compat.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
#pragma once

#ifdef USE_ROCM
#include <hip/hip_runtime.h>
#endif

#ifndef USE_ROCM
#define WARP_SIZE 32
#else
#define WARP_SIZE warpSize
#endif

#ifndef USE_ROCM
#define VLLM_LDG(arg) __ldg(arg)
#else
Expand Down
3 changes: 3 additions & 0 deletions csrc/punica/bgmv/bgmv_config.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,15 @@ void bgmv_kernel(out_T *__restrict__ Y, const in_T *__restrict__ X,
f(in_T, out_T, W_T, narrow, 128) \
f(in_T, out_T, W_T, narrow, 256) \
f(in_T, out_T, W_T, narrow, 512) \
f(in_T, out_T, W_T, narrow, 768) \
f(in_T, out_T, W_T, narrow, 1024) \
f(in_T, out_T, W_T, narrow, 1280) \
f(in_T, out_T, W_T, narrow, 1728) \
f(in_T, out_T, W_T, narrow, 1792) \
f(in_T, out_T, W_T, narrow, 2048) \
f(in_T, out_T, W_T, narrow, 2560) \
f(in_T, out_T, W_T, narrow, 2752) \
f(in_T, out_T, W_T, narrow, 2816) \
f(in_T, out_T, W_T, narrow, 3072) \
f(in_T, out_T, W_T, narrow, 3456) \
f(in_T, out_T, W_T, narrow, 3584) \
Expand All @@ -36,6 +38,7 @@ void bgmv_kernel(out_T *__restrict__ Y, const in_T *__restrict__ X,
f(in_T, out_T, W_T, narrow, 10240) \
f(in_T, out_T, W_T, narrow, 11008) \
f(in_T, out_T, W_T, narrow, 12288) \
f(in_T, out_T, W_T, narrow, 13696) \
f(in_T, out_T, W_T, narrow, 13824) \
f(in_T, out_T, W_T, narrow, 14336) \
f(in_T, out_T, W_T, narrow, 16384) \
Expand Down
6 changes: 3 additions & 3 deletions csrc/reduction_utils.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,15 @@ namespace vllm {
template<typename T>
__inline__ __device__ T warpReduceSum(T val) {
#pragma unroll
for (int mask = 16; mask > 0; mask >>= 1)
for (int mask = WARP_SIZE/2; mask > 0; mask >>= 1)
val += VLLM_SHFL_XOR_SYNC(val, mask);
return val;
}

/* Calculate the sum of all elements in a block */
template<typename T>
__inline__ __device__ T blockReduceSum(T val) {
static __shared__ T shared[32];
static __shared__ T shared[WARP_SIZE];
int lane = threadIdx.x & 0x1f;
int wid = threadIdx.x >> 5;

Expand All @@ -45,7 +45,7 @@ __inline__ __device__ T blockReduceSum(T val) {

// Modify from blockDim.x << 5 to blockDim.x / 32. to prevent
// blockDim.x is not divided by 32
val = (threadIdx.x < (blockDim.x / 32.f)) ? shared[lane] : (T)(0.0f);
val = (threadIdx.x < (blockDim.x / (WARP_SIZE * 1.0f))) ? shared[lane] : (T)(0.0f);
val = warpReduceSum<T>(val);
return val;
}
Expand Down
11 changes: 9 additions & 2 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,15 @@

# Mock out external dependencies here.
autodoc_mock_imports = [
"torch", "transformers", "psutil", "prometheus_client", "sentencepiece",
"vllm.cuda_utils", "vllm._C"
"torch",
"transformers",
"psutil",
"prometheus_client",
"sentencepiece",
"vllm.cuda_utils",
"vllm._C",
"numpy",
"tqdm",
]

for mock_target in autodoc_mock_imports:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/models/lora.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,9 @@ Requests can specify the LoRA adapter as if it were any other model via the ``mo
processed according to the server-wide LoRA configuration (i.e. in parallel with base model requests, and potentially other
LoRA adapter requests if they were provided and ``max_loras`` is set high enough).

The following is an example request
The following is an example request

.. code-block::bash
.. code-block::bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
Expand Down
Loading

0 comments on commit 7bbd2cc

Please sign in to comment.