This repository has been archived by the owner on Oct 11, 2024. It is now read-only.
forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>
Co-authored-by: roy <jasonailu87@gmail.com>
Co-authored-by: chen shen <scv119@gmail.com>
…e-ray` (vllm-project#2664) * fix: engine-useray complain * fix: typo
…uld respect prefix_len (vllm-project#2688) Signed-off-by: Tao He <sighingnow@gmail.com>
SUMMARY * add callable seed workflow for initial boundary testing Co-authored-by: marcella-found <marcella.found@gmail.com>
A warning will be printed out if this case is triggered: ``` WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models ``` Works on a T4 with: ```python from vllm import LLM, SamplingParams model = LLM( "nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True, dtype="float16", ) sampling_params = SamplingParams(max_tokens=100, temperature=0) outputs = model.generate("Hello my name is", sampling_params=sampling_params) outputs[0].outputs[0].text ``` Test within colab: https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
Add initial bechmark workflow --------- Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
SUMMARY: * initial set of "actions with a little a" that are the building blocks for eventual CI system * "build test" workflow * "remote push" workflow on `a10g` * update some requirement files to have packages listed in alphabetical order NOTE: this PR is still somewhat nebulas as i'm still working through building and testing "neuralmagic-vllm" in our automation environment. TEST: currently, i'm working through various workflow components, i.e. "actions with a little a". the bits making up the actions in this PR have been constructed from my notes along the way. we can do a "complete" run that includes: linting, building, installing, and running tests. GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564 `testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097 Latest GHA link ... https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982 --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>
Tested by making sure magic_wand was uninstalled and this code for a dense model runs fine: ```python from vllm import LLM, SamplingParams model = LLM("nm-testing/opt-125m-pruned2.4", enforce_eager=True) ``` Then testing with a sparse model run: ```python from vllm import LLM, SamplingParams model = LLM("nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True) ``` output: ``` ... File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/weight_utils.py", line 93, in get_sparse_config from vllm.model_executor.layers.sparsity import get_sparsity_config File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/layers/sparsity/__init__.py", line 6, in <module> raise ValueError( ValueError: magic_wand is not available and required for sparsity support. Please install it with `pip install magic_wand` ```
LucasWilkinson
force-pushed
the
rs/bump-main-to-v0.3.2
branch
from
February 22, 2024 17:51
db66ca8
to
acb8615
Compare
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>
SUMMARY * update `TORCH_CUDA_ARCH_LIST` to match `magic_wand` * update "test vllm" action to run tests serially * add helper script to find *.py tests, run them serially, and output JUnit formatted xml TEST working through changes manually on debug instance --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>
mgoin
approved these changes
Feb 23, 2024
Tested by checking the help message in openai server: ``` python -m vllm.entrypoints.openai.api_server --help ``` Before: ``` --sparsity {sparse_w16a16,None}, -s {sparse_w16a16,None} Method used to compress sparse weights. If None, we first check the `sparsity_config` attribute in the model config file. If that is None we assume the model weights are dense ``` After: ``` --sparsity {None,sparse_w16a16,semi_structured_sparse_w16a16}, -s {None,sparse_w16a16,semi_structured_sparse_w16a16} Method used to compress sparse weights. If None, we first check the `sparsity_config` attribute in the model config file. If that is None we assume the model weights are dense ```
SUMMARY: * "remote push" job for multi-gpu runner. * "remote push" job for single gpu runner. * patches for re-initialization of "ray". found other places in `vllm` where they are passing in `ignore_reinit_error=True`, it just looked like they missed a couple of places. * patch "find" command to only find *.py files starting with "test_". TEST PLAN: runs on remote push --------- Co-authored-by: andy-neuma <andy@neuralmagic.com>
andy-neuma
approved these changes
Feb 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. :)
SUMMARY * `yapf` format a couple of test files TEST PLAN: ran `yapf` in-place locally to get the files updated.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.