Releases: neuralmagic/nm-vllm
v0.5.0
Key Features
This is based on upstream vllm = v0.5.0.post
What's Changed
- bump up version to 0.5.0 by @dhuangnm in #278
- update publish.yml by @andy-neuma in #280
- fix a minor bug for docker build by @dhuangnm in #281
- update publish.yml by @andy-neuma in #282
- [CI/Build] Verify licenses by @derekk-nm in #272
- strip binaries by @dhuangnm in #283
- only run multi-gpu for python 3.10.12 by @andy-neuma in #284
- add more models, new num_logprobs by @derekk-nm in #285
- upload NIGHTLY assets to GCP by @andy-neuma in #286
- GCP test runners by @andy-neuma in #275
- Add nightly tag by @dhuangnm in #287
- Upstream sync 2024 06 08 by @robertgshaw2-neuralmagic in #288
- [Rel Eng] Update Nightly Workflow To Use Proper Skip List by @robertgshaw2-neuralmagic in #296
- [Rel Eng] Upstream sync 2024 06 11 by @robertgshaw2-neuralmagic in #298
- use nm-pypi service account by @andy-neuma in #300
- default nvcc_threads to 8 in order to reduce build execution time by @derekk-nm in #304
- Upstream sync 2024 06 12 by @robertgshaw2-neuralmagic in #302
- Fix docker image build issue by @dhuangnm in #305
- Remote push refactor by @robertgshaw2-neuralmagic in #297
- Update nm-nightly.yml by @derekk-nm in #308
- Use shared actions by @dbarbuzzi in #309
- enble tests that require C compiler by @andy-neuma in #310
- [ CI ] Fix Failing Test Server Logprobs (tolerance tweak) by @robertgshaw2-neuralmagic in #312
- [ CI ] Fix Failing Magic Wand Test by @robertgshaw2-neuralmagic in #311
- Add githash to nm-vllm by @dhuangnm in #299
- Upstream sync 2024 06 16 by @robertgshaw2-neuralmagic in #307
- [ CI ] skip local_workers_clean_shutdown by @robertgshaw2-neuralmagic in #317
- set PYTHON-3-10 job to gcp by @derekk-nm in #318
- [Rel Eng] Dial In LM Eval Tests Phase 1 by @robertgshaw2-neuralmagic in #289
- revert githash commit by @dhuangnm in #320
- Pruned Readme by @robertgshaw2-neuralmagic in #313
- Force-disable upstream tracking by @dbarbuzzi in #321
- [ README ] Update README.md by @robertgshaw2-neuralmagic in #323
Full Changelog: 0.4.0...0.5.0
v0.4.0
Key Features
This release is based on vllm==0.4.3
What's Changed
- turn off single gpu scenario by @andy-neuma in #88
- Benchmarking : Absolute -> Relative imports by @varun-sundar-rabindranath in #85
- Benchmarking : update Gi_per_thread by @varun-sundar-rabindranath in #90
- Update README.md with sparsity and quantization explainers by @mgoin in #91
- Add notebooks for sparsegpt and marlin compression with nm-vllm by @mgoin in #94
- upstream sync 2024-03-04 by @andy-neuma in #89
- Update README.md by @robertgshaw2-neuralmagic in #96
- Formatting : Fix yapf by @varun-sundar-rabindranath in #101
- Lower unstructured sparsity threshold to 40% by @mgoin in #100
- Benchmarking : Misc updates by @varun-sundar-rabindranath in #95
- upstream merge sync 2024-03-11 by @andy-neuma in #108
- Add lm-eval comparison script by @mgoin in #99
- Benchmarks : Standardize benchmark result store by @varun-sundar-rabindranath in #87
- seed whl centric workflows by @andy-neuma in #116
- Benchmarking : Remote push job by @varun-sundar-rabindranath in #92
- reverted accidental commit to main by @robertgshaw2-neuralmagic in #119
- skipped test for nightly failure by @robertgshaw2-neuralmagic in #120
- Turned back on the Marlin tests by @robertgshaw2-neuralmagic in #121
- Benchmarking : Prepare for GHA benchmark UI by @varun-sundar-rabindranath in #122
- Upstream sync 2024 03 14 by @robertgshaw2-neuralmagic in #127
- Benchmark : Update benchmark configs for Nightly by @varun-sundar-rabindranath in #126
- Benchmark : Modify/Add workflows/actions for github-action-benchmark by @varun-sundar-rabindranath in #123
- Benchmark: fix nightly by @varun-sundar-rabindranath in #131
- Fix nightly - 03/18/2024 by @varun-sundar-rabindranath in #136
- Upstream sync 2024 03 18 by @robertgshaw2-neuralmagic in #134
- Update Dockerfile with extensions support by @mgoin in #107
- Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by @varun-sundar-rabindranath in #130
- Benchmark Fix : Remove special tokens from warmup prompts by @varun-sundar-rabindranath in #140
- Delete .github/pull_request_template.md by @mgoin in #145
- Benchmarking : Update readme by @varun-sundar-rabindranath in #144
- Initial Layerwise Profiler by @LucasWilkinson in #124
- Benchmark Fix : Fix JSON decode error by @varun-sundar-rabindranath in #142
- Upstream sync 2024 03 24 by @robertgshaw2-neuralmagic in #143
- Benchmark : Fix remote push job by @varun-sundar-rabindranath in #129
- Benchmarks : Prune nightly benchmarks by @varun-sundar-rabindranath in #150
- Lock lm-evaluation-harness to commit 262f879 by @mgoin in #151
- Benchmarks : Copy benchmark results to EFS by @varun-sundar-rabindranath in #148
- update readme with nvcc threads option by @varun-sundar-rabindranath in #153
- Generate tarball along with wheel build, and upload both in a package to GH by @dhuangnm in #138
- switch to nightly whl's by @andy-neuma in #154
- whl centric workflow for "remote push" by @andy-neuma in #117
- remove low-workload benchmarks that are flaky by @varun-sundar-rabindranath in #156
- nightly patches by @andy-neuma in #160
- Upstream sync v0.4.0.post1 (merged with
upstream-v0.4.0.post1
) by @mgoin in #157 - Bump version to 0.2 by @mgoin in #165
- rename wheels to manylinux and remove unused action by @dhuangnm in #167
- Update collect_env.py package list by @mgoin in #169
- Add lm-eval full accuracy sweep using GSM8k by @mgoin in #166
- Upstream sync 2024 04 08 by @SageMoore in #173
- Updated logo in README by @rgreenberg1 in #178
- Fix sparsity arg in Engine/ModelArgs by @mgoin in #179
- rm model_executor/layers/attention directory since it's been moved by @tlrmchlsmth in #181
- Upstream sync 2024 04 12 by @andy-neuma in #183
- mm publish workflow by @andy-neuma in #193
- GCP related build workflow updates by @andy-neuma in #196
- switch to GCP based build VM by @andy-neuma in #201
- cleanup venv by @andy-neuma in #217
- Upstream sync 2024 04 26 by @robertgshaw2-neuralmagic in #211
- update workflows to use generated whls by @andy-neuma in #204
- Fix nightly benchmark scripts by @dbarbuzzi in #229
- Add lm-eval correctness test by @dbarbuzzi in #210
- switch to k8s runners by @andy-neuma in #231
- Upstream sync 2024 05 05 by @robertgshaw2-neuralmagic in #224
- Marlin 2:4 Downstream (for v0.3 release) by @robertgshaw2-neuralmagic in #239
- Misc CI/CD updates by @dbarbuzzi in #240
- bump version to 0.3.0 by @dhuangnm in #241
- [Bugfix] Fix marlin 2:4 kernel crash on H100 by @mgoin in #243
- switch runner from aws to gcp for generate whl workflow by @dhuangnm in #242
- Add FP8 and marlin 2:4 tests for lm-eval by @mgoin in #244
- updates for nm-magic-wand, nightly or release by @andy-neuma in #247
- version check patch by @andy-neuma in #251
- increase timeouts by @andy-neuma in #253
requirements-dev.txt
and workflow patches by @andy-neuma in #255- updates for automation (and release) by @andy-neuma in #265
- update install commands by @dhuangnm in #264
- Address py38/39 incompatibilities by @dbarbuzzi in #261
- [CI/Build] Basic server correctness test by @derekk-nm in #237
- bump up version and gate magic-wand version by @dhuangnm in #267
- remove release worklfow concurrency limit by @andy-neuma in #270
- [CI/Build] include NOTICE in package dist-info by @derekk-nm in #271
- switch benchmarking and testing jobs to run using "test" label by @andy-neuma in #273
- Handle server startup failure in enter by @dbarbuzzi in #274
- Upstream sync 2024 05 19 by @robertgshaw2-neuralmagic in #249
- Docker image improvements by @dhuangnm in #276
- add latest tag for release docker image by @dhuangnm in #279
New Contributors
- @SageMoore made their first contribution in #173
- @rgreenberg1 made their first contribution in #178
- @derekk-nm made their first contribution in #237
Full Changelog: 0.1.0...v0.4.0
v0.3.0
Key Features
This release is based on vllm==0.4.2
What's Changed
- turn off single gpu scenario by @andy-neuma in #88
- Benchmarking : Absolute -> Relative imports by @varun-sundar-rabindranath in #85
- Benchmarking : update Gi_per_thread by @varun-sundar-rabindranath in #90
- Update README.md with sparsity and quantization explainers by @mgoin in #91
- Add notebooks for sparsegpt and marlin compression with nm-vllm by @mgoin in #94
- upstream sync 2024-03-04 by @andy-neuma in #89
- Update README.md by @robertgshaw2-neuralmagic in #96
- Formatting : Fix yapf by @varun-sundar-rabindranath in #101
- Lower unstructured sparsity threshold to 40% by @mgoin in #100
- Benchmarking : Misc updates by @varun-sundar-rabindranath in #95
- upstream merge sync 2024-03-11 by @andy-neuma in #108
- Add lm-eval comparison script by @mgoin in #99
- Benchmarks : Standardize benchmark result store by @varun-sundar-rabindranath in #87
- seed whl centric workflows by @andy-neuma in #116
- Benchmarking : Remote push job by @varun-sundar-rabindranath in #92
- reverted accidental commit to main by @robertgshaw2-neuralmagic in #119
- skipped test for nightly failure by @robertgshaw2-neuralmagic in #120
- Turned back on the Marlin tests by @robertgshaw2-neuralmagic in #121
- Benchmarking : Prepare for GHA benchmark UI by @varun-sundar-rabindranath in #122
- Upstream sync 2024 03 14 by @robertgshaw2-neuralmagic in #127
- Benchmark : Update benchmark configs for Nightly by @varun-sundar-rabindranath in #126
- Benchmark : Modify/Add workflows/actions for github-action-benchmark by @varun-sundar-rabindranath in #123
- Benchmark: fix nightly by @varun-sundar-rabindranath in #131
- Fix nightly - 03/18/2024 by @varun-sundar-rabindranath in #136
- Upstream sync 2024 03 18 by @robertgshaw2-neuralmagic in #134
- Update Dockerfile with extensions support by @mgoin in #107
- Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by @varun-sundar-rabindranath in #130
- Benchmark Fix : Remove special tokens from warmup prompts by @varun-sundar-rabindranath in #140
- Delete .github/pull_request_template.md by @mgoin in #145
- Benchmarking : Update readme by @varun-sundar-rabindranath in #144
- Initial Layerwise Profiler by @LucasWilkinson in #124
- Benchmark Fix : Fix JSON decode error by @varun-sundar-rabindranath in #142
- Upstream sync 2024 03 24 by @robertgshaw2-neuralmagic in #143
- Benchmark : Fix remote push job by @varun-sundar-rabindranath in #129
- Benchmarks : Prune nightly benchmarks by @varun-sundar-rabindranath in #150
- Lock lm-evaluation-harness to commit 262f879 by @mgoin in #151
- Benchmarks : Copy benchmark results to EFS by @varun-sundar-rabindranath in #148
- update readme with nvcc threads option by @varun-sundar-rabindranath in #153
- Generate tarball along with wheel build, and upload both in a package to GH by @dhuangnm in #138
- switch to nightly whl's by @andy-neuma in #154
- whl centric workflow for "remote push" by @andy-neuma in #117
- remove low-workload benchmarks that are flaky by @varun-sundar-rabindranath in #156
- nightly patches by @andy-neuma in #160
- Upstream sync v0.4.0.post1 (merged with
upstream-v0.4.0.post1
) by @mgoin in #157 - Bump version to 0.2 by @mgoin in #165
- rename wheels to manylinux and remove unused action by @dhuangnm in #167
- Update collect_env.py package list by @mgoin in #169
- Add lm-eval full accuracy sweep using GSM8k by @mgoin in #166
- Upstream sync 2024 04 08 by @SageMoore in #173
- Updated logo in README by @rgreenberg1 in #178
- Fix sparsity arg in Engine/ModelArgs by @mgoin in #179
- rm model_executor/layers/attention directory since it's been moved by @tlrmchlsmth in #181
- Upstream sync 2024 04 12 by @andy-neuma in #183
- mm publish workflow by @andy-neuma in #193
- GCP related build workflow updates by @andy-neuma in #196
- switch to GCP based build VM by @andy-neuma in #201
- cleanup venv by @andy-neuma in #217
- Upstream sync 2024 04 26 by @robertgshaw2-neuralmagic in #211
- update workflows to use generated whls by @andy-neuma in #204
- Fix nightly benchmark scripts by @dbarbuzzi in #229
- Add lm-eval correctness test by @dbarbuzzi in #210
- switch to k8s runners by @andy-neuma in #231
- Upstream sync 2024 05 05 by @robertgshaw2-neuralmagic in #224
- Marlin 2:4 Downstream (for v0.3 release) by @robertgshaw2-neuralmagic in #239
- Misc CI/CD updates by @dbarbuzzi in #240
- bump version to 0.3.0 by @dhuangnm in #241
- [Cherrypick, Bugfix] Fix marlin 2:4 kernel crash on H100 by @mgoin in #245
- [cherry-pick] Update gen-whl.yml by @dhuangnm in #246
- updates for nm-magic-wand, nightly or release (#247) by @andy-neuma in #248
New Contributors
- @SageMoore made their first contribution in #173
- @rgreenberg1 made their first contribution in #178
- @dbarbuzzi made their first contribution in #229
Full Changelog: 0.1.0...0.3.0
v0.2.0
Key Features
This release is based on vllm==0.4.0.post1
- New model architectures supported!
DbrxForCausalLM
,CohereForCausalLM
(Command-R),JAISLMHeadModel
,LlavaForConditionalGeneration
(experimental vision LM),OrionForCausalLM
,Qwen2MoeForCausalLM
,StableLmForCausalLM
,Starcoder2ForCausalLM
,XverseForCausalLM
- Automated benchmarking
- Code coverage reporting
- lm-evaluation-harness nightly accuracy testing
- Layerwise Profiling for the inference graph (#124)
What's Changed
- turn off single gpu scenario by @andy-neuma in #88
- Benchmarking : Absolute -> Relative imports by @varun-sundar-rabindranath in #85
- Benchmarking : update Gi_per_thread by @varun-sundar-rabindranath in #90
- Update README.md with sparsity and quantization explainers by @mgoin in #91
- Add notebooks for sparsegpt and marlin compression with nm-vllm by @mgoin in #94
- upstream sync 2024-03-04 by @andy-neuma in #89
- Update README.md by @robertgshaw2-neuralmagic in #96
- Formatting : Fix yapf by @varun-sundar-rabindranath in #101
- Lower unstructured sparsity threshold to 40% by @mgoin in #100
- Benchmarking : Misc updates by @varun-sundar-rabindranath in #95
- upstream merge sync 2024-03-11 by @andy-neuma in #108
- Add lm-eval comparison script by @mgoin in #99
- Benchmarks : Standardize benchmark result store by @varun-sundar-rabindranath in #87
- seed whl centric workflows by @andy-neuma in #116
- Benchmarking : Remote push job by @varun-sundar-rabindranath in #92
- reverted accidental commit to main by @robertgshaw2-neuralmagic in #119
- skipped test for nightly failure by @robertgshaw2-neuralmagic in #120
- Turned back on the Marlin tests by @robertgshaw2-neuralmagic in #121
- Benchmarking : Prepare for GHA benchmark UI by @varun-sundar-rabindranath in #122
- Upstream sync 2024 03 14 by @robertgshaw2-neuralmagic in #127
- Benchmark : Update benchmark configs for Nightly by @varun-sundar-rabindranath in #126
- Benchmark : Modify/Add workflows/actions for github-action-benchmark by @varun-sundar-rabindranath in #123
- Benchmark: fix nightly by @varun-sundar-rabindranath in #131
- Fix nightly - 03/18/2024 by @varun-sundar-rabindranath in #136
- Upstream sync 2024 03 18 by @robertgshaw2-neuralmagic in #134
- Update Dockerfile with extensions support by @mgoin in #107
- Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by @varun-sundar-rabindranath in #130
- Benchmark Fix : Remove special tokens from warmup prompts by @varun-sundar-rabindranath in #140
- Delete .github/pull_request_template.md by @mgoin in #145
- Benchmarking : Update readme by @varun-sundar-rabindranath in #144
- Initial Layerwise Profiler by @LucasWilkinson in #124
- Benchmark Fix : Fix JSON decode error by @varun-sundar-rabindranath in #142
- Upstream sync 2024 03 24 by @robertgshaw2-neuralmagic in #143
- Benchmark : Fix remote push job by @varun-sundar-rabindranath in #129
- Benchmarks : Prune nightly benchmarks by @varun-sundar-rabindranath in #150
- Lock lm-evaluation-harness to commit 262f879 by @mgoin in #151
- Benchmarks : Copy benchmark results to EFS by @varun-sundar-rabindranath in #148
- update readme with nvcc threads option by @varun-sundar-rabindranath in #153
- Generate tarball along with wheel build, and upload both in a package to GH by @dhuangnm in #138
- switch to nightly whl's by @andy-neuma in #154
- whl centric workflow for "remote push" by @andy-neuma in #117
- remove low-workload benchmarks that are flaky by @varun-sundar-rabindranath in #156
- nightly patches by @andy-neuma in #160
- Upstream sync v0.4.0.post1 (merged with
upstream-v0.4.0.post1
) by @mgoin in #157 - Bump version to 0.2 by @mgoin in #165
New Contributors
Full Changelog: 0.1.0...0.2.0
v0.1.0
Initial release of 🪄 nm-vllm 🪄
nm-vllm is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.
This release is based on vllm==0.3.2
Key Features
This first release focuses on our initial LLM performance contributions through support for Marlin, an extremely optimized FP16xINT4 matmul kernel, and weight sparsity acceleration.
Model Inference with Marlin (4-bit Quantization)
Marlin is enabled automatically if a quantized model has the "is_marlin_format": true
flag present in it's quant_config.json
from vllm import LLM
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
print(model.generate("Hello quantized world!")
Optionally, you can specify it explicitly by setting quantization="marlin"
.
Model Inference with Weight Sparsity
nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.
Here is an example running a 50% sparse OpenHermes 2.5 Mistral 7B model fine-tuned for instruction-following:
from vllm import LLM, SamplingParams
model = LLM(
"nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
sparsity="sparse_w16a16",
max_model_len=1024
)
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
There is also support for semi-structured 2:4 sparsity using the sparsity="semi_structured_sparse_w16a16"
argument:
from vllm import LLM, SamplingParams
model = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
What's Changed
- Sparsity by @robertgshaw2-neuralmagic in #1
- Sparse fused gemm integration by @LucasWilkinson in #12
- Abf149/fix semi structured sparse by @afeldman-nm in #16
- Enable bfloat16 for sparse_w16a16 by @mgoin in #18
- seed workflow by @andy-neuma in #19
- Add bias support for sparse layers by @mgoin in #25
- Use naive decompress for SM<8.0 by @mgoin in #32
- Varun/benchmark workflow by @varun-sundar-rabindranath in #28
- initial GHA workflows for "build test" and "remote push" by @andy-neuma in #27
- Only import magic_wand if sparsity is enabled by @mgoin in #37
- Sparsity fix by @robertgshaw2-neuralmagic in #40
- Add NM benchmarking scripts & utils by @varun-sundar-rabindranath in #14
- Rs/marlin downstream v0.3.2 by @robertgshaw2-neuralmagic in #43
- Update README.md by @mgoin in #47
- additional updates to "bump-to-v0.3.2" by @andy-neuma in #39
- Add empty tensor initialization to LazyCompressedParameter by @alexm-nm in #53
- Update arg_utils.py with
semi_structured_sparse_w16a16
by @mgoin in #45 - additions for bump to v0.3.2 by @andy-neuma in #50
- formatting patch by @andy-neuma in #54
- Rs/bump main to v0.3.2 by @robertgshaw2-neuralmagic in #38
- Update setup.py naming by @mgoin in #44
- Loudly reject compression when the tensor isn't sparse enough by @mgoin in #55
- Benchmarking : Fix server response aggregation by @varun-sundar-rabindranath in #51
- initial whl workflow by @andy-neuma in #57
- GHA Benchmark : Automatic benchmarking on manual trigger by @varun-sundar-rabindranath in #46
- delete NOTICE.txt by @andy-neuma in #63
- pin GPU and use "--forked" for some tests by @andy-neuma in #58
- obsfucate pypi server ip by @andy-neuma in #64
- add HF cache by @andy-neuma in #65
- Rs/sparse integration test clean 2 by @robertgshaw2-neuralmagic in #67
- neuralmagic-vllm -> nm-vllm by @mgoin in #69
- Mark files that have been modified by Neural Magic by @tlrmchlsmth in #70
- Benchmarking - Add tensor_parallel_size arg for multi-gpu benchmarking by @varun-sundar-rabindranath in #66
- Jfinks license by @jeanniefinks in #72
- Add Nightly benchmark workflow by @varun-sundar-rabindranath in #62
- Rs/licensing by @robertgshaw2-neuralmagic in #68
- Rs/model integration tests logprobs by @robertgshaw2-neuralmagic in #71
- fixes issue identified by derek by @robertgshaw2-neuralmagic in #83
- Add
nm-vllm[sparse]
+nm-vllm[sparsity]
extras, move version to0.1
by @mgoin in #76 - Update setup.py by @mgoin in #82
- Fixes the multi-gpu tests by @robertgshaw2-neuralmagic in #79
- various updates to "build whl" workflow by @andy-neuma in #59
- Change magic_wand to nm-magic-wand by @mgoin in #86
New Contributors
- @LucasWilkinson made their first contribution in #12
- @alexm-nm made their first contribution in #53
- @tlrmchlsmth made their first contribution in #70
- @jeanniefinks made their first contribution in #72
Full Changelog: https://github.com/neuralmagic/nm-vllm/commits/0.1.0