Rs/bump main to v0.3.2 #38

robertgshaw2-neuralmagic · 2024-02-22T15:14:58Z

No description provided.

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>

Co-authored-by: roy <jasonailu87@gmail.com>

Co-authored-by: chen shen <scv119@gmail.com>

…e-ray` (vllm-project#2664) * fix: engine-useray complain * fix: typo

…uld respect prefix_len (vllm-project#2688) Signed-off-by: Tao He <sighingnow@gmail.com>

SUMMARY * add callable seed workflow for initial boundary testing Co-authored-by: marcella-found <marcella.found@gmail.com>

A warning will be printed out if this case is triggered: ``` WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models ``` Works on a T4 with: ```python from vllm import LLM, SamplingParams model = LLM( "nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True, dtype="float16", ) sampling_params = SamplingParams(max_tokens=100, temperature=0) outputs = model.generate("Hello my name is", sampling_params=sampling_params) outputs[0].outputs[0].text ``` Test within colab: https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing

Add initial bechmark workflow --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

SUMMARY: * initial set of "actions with a little a" that are the building blocks for eventual CI system * "build test" workflow * "remote push" workflow on `a10g` * update some requirement files to have packages listed in alphabetical order NOTE: this PR is still somewhat nebulas as i'm still working through building and testing "neuralmagic-vllm" in our automation environment. TEST: currently, i'm working through various workflow components, i.e. "actions with a little a". the bits making up the actions in this PR have been constructed from my notes along the way. we can do a "complete" run that includes: linting, building, installing, and running tests. GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564 `testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097 Latest GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982 --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>

Tested by making sure magic_wand was uninstalled and this code for a dense model runs fine: ```python from vllm import LLM, SamplingParams model = LLM("nm-testing/opt-125m-pruned2.4", enforce_eager=True) ``` Then testing with a sparse model run: ```python from vllm import LLM, SamplingParams model = LLM("nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True) ``` output: ``` ... File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/weight_utils.py", line 93, in get_sparse_config from vllm.model_executor.layers.sparsity import get_sparsity_config File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/layers/sparsity/__init__.py", line 6, in <module> raise ValueError( ValueError: magic_wand is not available and required for sparsity support. Please install it with `pip install magic_wand` ```

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>

SUMMARY * update `TORCH_CUDA_ARCH_LIST` to match `magic_wand` * update "test vllm" action to run tests serially * add helper script to find *.py tests, run them serially, and output JUnit formatted xml TEST working through changes manually on debug instance --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>

Tested by checking the help message in openai server: ``` python -m vllm.entrypoints.openai.api_server --help ``` Before: ``` --sparsity {sparse_w16a16,None}, -s {sparse_w16a16,None} Method used to compress sparse weights. If None, we first check the `sparsity_config` attribute in the model config file. If that is None we assume the model weights are dense ``` After: ``` --sparsity {None,sparse_w16a16,semi_structured_sparse_w16a16}, -s {None,sparse_w16a16,semi_structured_sparse_w16a16} Method used to compress sparse weights. If None, we first check the `sparsity_config` attribute in the model config file. If that is None we assume the model weights are dense ```

SUMMARY: * "remote push" job for multi-gpu runner. * "remote push" job for single gpu runner. * patches for re-initialization of "ray". found other places in `vllm` where they are passing in `ignore_reinit_error=True`, it just looked like they missed a couple of places. * patch "find" command to only find *.py files starting with "test_". TEST PLAN: runs on remote push --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>

andy-neuma

cool. :)

SUMMARY * `yapf` format a couple of test files TEST PLAN: ran `yapf` in-place locally to get the files updated.

hongxiayang and others added 30 commits January 26, 2024 12:41

[ROCm] add support to ROCm 6.0 and MI300 (vllm-project#2274)

6b7de1a

Support for Stable LM 2 (vllm-project#2598)

3a0e1fc

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Don't build punica kernels by default (vllm-project#2605)

390b495

AWQ: Up to 2.66x higher throughput (vllm-project#2566)

beb89f6

Use head_dim in config if exists (vllm-project#2622)

220a476

Implement custom all reduce kernels (vllm-project#2192)

3801700

[Minor] Fix warning on Ray dependencies (vllm-project#2630)

5f036d2

Speed up Punica compilation (vllm-project#2632)

f8ecb84

Small async_llm_engine refactor (vllm-project#2618)

89be30f

Update Ray version requirements (vllm-project#2636)

7d64841

Support FP8-E5M2 KV Cache (vllm-project#2279)

9090bf0

Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Fix error when tp > 1 (vllm-project#2644)

b72af8f

Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>

No repeated IPC open (vllm-project#2642)

1b20639

ROCm: Allow setting compilation target (vllm-project#2581)

ea8489f

DeepseekMoE support with Fused MoE kernel (vllm-project#2453)

5d60def

Co-authored-by: roy <jasonailu87@gmail.com>

Fused MOE for Mixtral (vllm-project#2542)

ab40644

Co-authored-by: chen shen <scv119@gmail.com>

Fix 'Actor methods cannot be called directly' when using `--engine-us…

d79ced3

…e-ray` (vllm-project#2664) * fix: engine-useray complain * fix: typo

Add swap_blocks unit tests (vllm-project#2616)

4f65af0

[Minor] Fix a small typo (vllm-project#2672)

bbe9bd9

[Minor] Fix false warning when TP=1 (vllm-project#2674)

105a40f

Add quantized mixtral support (vllm-project#2673)

3dad944

Bump up version to v0.3.0 (vllm-project#2656)

1af090b

Fixes assertion failure in prefix caching: the lora index mapping sho…

d69ff0c

…uld respect prefix_len (vllm-project#2688) Signed-off-by: Tao He <sighingnow@gmail.com>

fix some bugs (vllm-project#2689)

c664b0e

[Minor] Fix test_cache.py CI test failure (vllm-project#2684)

89efcf1

Add unit test for Mixtral MoE layer (vllm-project#2677)

d0d93b9

Refactor Prometheus and Add Request Level Metrics (vllm-project#2316)

93b38be

Add Internlm2 (vllm-project#2666)

cd9e60c

Fix compile error when using rocm (vllm-project#2648)

923797f

fix python 3.8 syntax (vllm-project#2716)

b9e96b1

mgoin and others added 11 commits February 22, 2024 15:05

Enable bfloat16 for sparse_w16a16 (#18)

3c11f56

seed workflow (#19)

8147811

SUMMARY * add callable seed workflow for initial boundary testing Co-authored-by: marcella-found <marcella.found@gmail.com>

Add bias support for sparse layers (#25)

e802bc2

Varun/benchmark workflow (#28)

78ba5c1

Add initial bechmark workflow --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

manually reverted requirements to match v0.3.2

acf16bf

Merge branch 'main' into rs/bump-main-to-v0.3.2

dbf3cab

reverted requirements

0feedf9

removed duplicate

ce8164d

robertgshaw2-neuralmagic marked this pull request as ready for review February 22, 2024 15:23

robertgshaw2-neuralmagic added 2 commits February 22, 2024 15:28

format

166c13b

added noqa to upstream scripts for linter

1b395b4

robertgshaw2-neuralmagic mentioned this pull request Feb 22, 2024

Merge vllm-project/vllm 0.3.2 changes into release branch #36

Closed

robertgshaw2-neuralmagic and others added 2 commits February 22, 2024 15:39

format

8d935be

Sparsity fix (#40)

acb8615

LucasWilkinson force-pushed the rs/bump-main-to-v0.3.2 branch from db66ca8 to acb8615 Compare February 22, 2024 17:51

robertgshaw2-neuralmagic and others added 2 commits February 22, 2024 17:46

Rs/marlin downstream v0.3.2 (#43)

4b44479

Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>

mgoin approved these changes Feb 23, 2024

View reviewed changes

mgoin and others added 4 commits February 23, 2024 10:46

Merge branch 'main' into rs/bump-main-to-v0.3.2

31ecb4d

Add empty tensor initialization to LazyCompressedParameter (#53)

9eb83fe

andy-neuma approved these changes Feb 23, 2024

View reviewed changes

andy-neuma and others added 2 commits February 23, 2024 17:21

formatting patch

edeb6e6

formatting patch (#54)

68d79f7

SUMMARY * `yapf` format a couple of test files TEST PLAN: ran `yapf` in-place locally to get the files updated.

andy-neuma merged commit fdb3cbd into main Feb 23, 2024
2 checks passed

andy-neuma deleted the rs/bump-main-to-v0.3.2 branch February 23, 2024 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rs/bump main to v0.3.2 #38

Rs/bump main to v0.3.2 #38

robertgshaw2-neuralmagic commented Feb 22, 2024

andy-neuma left a comment

Rs/bump main to v0.3.2 #38

Rs/bump main to v0.3.2 #38

Conversation

robertgshaw2-neuralmagic commented Feb 22, 2024

andy-neuma left a comment

Choose a reason for hiding this comment