Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Rs/bump main to v0.3.2 #38

Merged
merged 118 commits into from
Feb 23, 2024
Merged

Rs/bump main to v0.3.2 #38

merged 118 commits into from
Feb 23, 2024

Conversation

robertgshaw2-neuralmagic
Copy link
Collaborator

No description provided.

hongxiayang and others added 30 commits January 26, 2024 12:41
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn>
Co-authored-by: roy <jasonailu87@gmail.com>
Co-authored-by: chen shen <scv119@gmail.com>
…uld respect prefix_len (vllm-project#2688)

Signed-off-by: Tao He <sighingnow@gmail.com>
mgoin and others added 11 commits February 22, 2024 15:05
SUMMARY
* add callable seed workflow for initial boundary testing

Co-authored-by: marcella-found <marcella.found@gmail.com>
A warning will be printed out if this case is triggered:
```
WARNING 02-20 22:21:27 sparse_w16a16.py:32] Unstructured sparse kernels are not optimized for NVIDIA SM < 8.0. Naive decompress kernels will be used and can be slower than dense models
```

Works on a T4 with:
```python
from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/opt-125m-pruned2.4", 
    sparsity="sparse_w16a16",
    enforce_eager=True,
    dtype="float16",
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
outputs[0].outputs[0].text
```

Test within colab:
https://colab.research.google.com/drive/15xRvWX5gNaTb00BcaXhxwMm6yxavIKGN?usp=sharing
Add initial bechmark workflow

---------

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
SUMMARY:
* initial set of "actions with a little a" that are the building blocks
for eventual CI system
* "build test" workflow
* "remote push" workflow on `a10g`
* update some requirement files to have packages listed in alphabetical
order

NOTE: this PR is still somewhat nebulas as i'm still working through
building and testing "neuralmagic-vllm" in our automation environment.

TEST:
currently, i'm working through various workflow components, i.e.
"actions with a little a". the bits making up the actions in this PR
have been constructed from my notes along the way.

we can do a "complete" run that includes: linting, building, installing,
and running tests.

GHA link ...
https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7975058564
`testmo` ... https://neuralmagic.testmo.net/automation/runs/view/8097

Latest GHA link ...
https://github.com/neuralmagic/neuralmagic-vllm/actions/runs/7992489982

---------

Co-authored-by: andy-neuma <andy@neuralmagic.com>
Tested by making sure magic_wand was uninstalled and this code for a
dense model runs fine:
```python
from vllm import LLM, SamplingParams
model = LLM("nm-testing/opt-125m-pruned2.4", enforce_eager=True)
```

Then testing with a sparse model run:
```python
from vllm import LLM, SamplingParams
model = LLM("nm-testing/opt-125m-pruned2.4", sparsity="sparse_w16a16", enforce_eager=True)
```
output:
```
...
  File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/weight_utils.py", line 93, in get_sparse_config
    from vllm.model_executor.layers.sparsity import get_sparsity_config
  File "/home/michael/code/neuralmagic-vllm/vllm/model_executor/layers/sparsity/__init__.py", line 6, in <module>
    raise ValueError(
ValueError: magic_wand is not available and required for sparsity support. Please install it with `pip install magic_wand`
```
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic marked this pull request as ready for review February 22, 2024 15:23
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
SUMMARY
* update `TORCH_CUDA_ARCH_LIST` to match `magic_wand`
* update "test vllm" action to run tests serially
* add helper script to find *.py tests, run them serially, and output
JUnit formatted xml

TEST
working through changes manually on debug instance

---------

Co-authored-by: andy-neuma <andy@neuralmagic.com>
mgoin and others added 4 commits February 23, 2024 10:46
Tested by checking the help message in openai server:
```
python -m vllm.entrypoints.openai.api_server --help
```

Before:
```
  --sparsity {sparse_w16a16,None}, -s {sparse_w16a16,None}
                        Method used to compress sparse weights. If None, we first check the `sparsity_config`
                        attribute in the model config file. If that is None we assume the model weights are dense
 ```
 
 After:
```
--sparsity {None,sparse_w16a16,semi_structured_sparse_w16a16}, -s
{None,sparse_w16a16,semi_structured_sparse_w16a16}
Method used to compress sparse weights. If None, we first check the
`sparsity_config`
attribute in the model config file. If that is None we assume the model
weights are dense
```
SUMMARY:
* "remote push" job for multi-gpu runner.
* "remote push" job for single gpu runner.
* patches for re-initialization of "ray". found other places in `vllm`
where they are passing in `ignore_reinit_error=True`, it just looked
like they missed a couple of places.
* patch "find" command to only find *.py files starting with "test_".


TEST PLAN:
runs on remote push

---------

Co-authored-by: andy-neuma <andy@neuralmagic.com>
Copy link
Member

@andy-neuma andy-neuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. :)

andy-neuma and others added 2 commits February 23, 2024 17:21
SUMMARY
* `yapf` format a couple of test files

TEST PLAN:
ran `yapf` in-place locally to get the files updated.
@andy-neuma andy-neuma merged commit fdb3cbd into main Feb 23, 2024
2 checks passed
@andy-neuma andy-neuma deleted the rs/bump-main-to-v0.3.2 branch February 23, 2024 22:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.