This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

24 Jun 14:12

dhuangnm

0.5.0

046eb08

v0.5.0 Latest

Latest

Key Features

This is based on upstream vllm = v0.5.0.post

What's Changed

bump up version to 0.5.0 by @dhuangnm in #278
update publish.yml by @andy-neuma in #280
fix a minor bug for docker build by @dhuangnm in #281
update publish.yml by @andy-neuma in #282
[CI/Build] Verify licenses by @derekk-nm in #272
strip binaries by @dhuangnm in #283
only run multi-gpu for python 3.10.12 by @andy-neuma in #284
add more models, new num_logprobs by @derekk-nm in #285
upload NIGHTLY assets to GCP by @andy-neuma in #286
GCP test runners by @andy-neuma in #275
Add nightly tag by @dhuangnm in #287
Upstream sync 2024 06 08 by @robertgshaw2-neuralmagic in #288
[Rel Eng] Update Nightly Workflow To Use Proper Skip List by @robertgshaw2-neuralmagic in #296
[Rel Eng] Upstream sync 2024 06 11 by @robertgshaw2-neuralmagic in #298
use nm-pypi service account by @andy-neuma in #300
default nvcc_threads to 8 in order to reduce build execution time by @derekk-nm in #304
Upstream sync 2024 06 12 by @robertgshaw2-neuralmagic in #302
Fix docker image build issue by @dhuangnm in #305
Remote push refactor by @robertgshaw2-neuralmagic in #297
Update nm-nightly.yml by @derekk-nm in #308
Use shared actions by @dbarbuzzi in #309
enble tests that require C compiler by @andy-neuma in #310
[ CI ] Fix Failing Test Server Logprobs (tolerance tweak) by @robertgshaw2-neuralmagic in #312
[ CI ] Fix Failing Magic Wand Test by @robertgshaw2-neuralmagic in #311
Add githash to nm-vllm by @dhuangnm in #299
Upstream sync 2024 06 16 by @robertgshaw2-neuralmagic in #307
[ CI ] skip local_workers_clean_shutdown by @robertgshaw2-neuralmagic in #317
set PYTHON-3-10 job to gcp by @derekk-nm in #318
[Rel Eng] Dial In LM Eval Tests Phase 1 by @robertgshaw2-neuralmagic in #289
revert githash commit by @dhuangnm in #320
Pruned Readme by @robertgshaw2-neuralmagic in #313
Force-disable upstream tracking by @dbarbuzzi in #321
[ README ] Update README.md by @robertgshaw2-neuralmagic in #323

Full Changelog: 0.4.0...0.5.0

Contributors

dbarbuzzi, dhuangnm, and 3 other contributors

Assets 7

03 Jun 20:26

github-actions

0.4.0

fdd9a31

v0.4.0

Key Features

This release is based on vllm==0.4.3

What's Changed

turn off single gpu scenario by @andy-neuma in #88
Benchmarking : Absolute -> Relative imports by @varun-sundar-rabindranath in #85
Benchmarking : update Gi_per_thread by @varun-sundar-rabindranath in #90
Update README.md with sparsity and quantization explainers by @mgoin in #91
Add notebooks for sparsegpt and marlin compression with nm-vllm by @mgoin in #94
upstream sync 2024-03-04 by @andy-neuma in #89
Update README.md by @robertgshaw2-neuralmagic in #96
Formatting : Fix yapf by @varun-sundar-rabindranath in #101
Lower unstructured sparsity threshold to 40% by @mgoin in #100
Benchmarking : Misc updates by @varun-sundar-rabindranath in #95
upstream merge sync 2024-03-11 by @andy-neuma in #108
Add lm-eval comparison script by @mgoin in #99
Benchmarks : Standardize benchmark result store by @varun-sundar-rabindranath in #87
seed whl centric workflows by @andy-neuma in #116
Benchmarking : Remote push job by @varun-sundar-rabindranath in #92
reverted accidental commit to main by @robertgshaw2-neuralmagic in #119
skipped test for nightly failure by @robertgshaw2-neuralmagic in #120
Turned back on the Marlin tests by @robertgshaw2-neuralmagic in #121
Benchmarking : Prepare for GHA benchmark UI by @varun-sundar-rabindranath in #122
Upstream sync 2024 03 14 by @robertgshaw2-neuralmagic in #127
Benchmark : Update benchmark configs for Nightly by @varun-sundar-rabindranath in #126
Benchmark : Modify/Add workflows/actions for github-action-benchmark by @varun-sundar-rabindranath in #123
Benchmark: fix nightly by @varun-sundar-rabindranath in #131
Fix nightly - 03/18/2024 by @varun-sundar-rabindranath in #136
Upstream sync 2024 03 18 by @robertgshaw2-neuralmagic in #134
Update Dockerfile with extensions support by @mgoin in #107
Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by @varun-sundar-rabindranath in #130
Benchmark Fix : Remove special tokens from warmup prompts by @varun-sundar-rabindranath in #140
Delete .github/pull_request_template.md by @mgoin in #145
Benchmarking : Update readme by @varun-sundar-rabindranath in #144
Initial Layerwise Profiler by @LucasWilkinson in #124
Benchmark Fix : Fix JSON decode error by @varun-sundar-rabindranath in #142
Upstream sync 2024 03 24 by @robertgshaw2-neuralmagic in #143
Benchmark : Fix remote push job by @varun-sundar-rabindranath in #129
Benchmarks : Prune nightly benchmarks by @varun-sundar-rabindranath in #150
Lock lm-evaluation-harness to commit 262f879 by @mgoin in #151
Benchmarks : Copy benchmark results to EFS by @varun-sundar-rabindranath in #148
update readme with nvcc threads option by @varun-sundar-rabindranath in #153
Generate tarball along with wheel build, and upload both in a package to GH by @dhuangnm in #138
switch to nightly whl's by @andy-neuma in #154
whl centric workflow for "remote push" by @andy-neuma in #117
remove low-workload benchmarks that are flaky by @varun-sundar-rabindranath in #156
nightly patches by @andy-neuma in #160
Upstream sync v0.4.0.post1 (merged with upstream-v0.4.0.post1) by @mgoin in #157
Bump version to 0.2 by @mgoin in #165
rename wheels to manylinux and remove unused action by @dhuangnm in #167
Update collect_env.py package list by @mgoin in #169
Add lm-eval full accuracy sweep using GSM8k by @mgoin in #166
Upstream sync 2024 04 08 by @SageMoore in #173
Updated logo in README by @rgreenberg1 in #178
Fix sparsity arg in Engine/ModelArgs by @mgoin in #179
rm model_executor/layers/attention directory since it's been moved by @tlrmchlsmth in #181
Upstream sync 2024 04 12 by @andy-neuma in #183
mm publish workflow by @andy-neuma in #193
GCP related build workflow updates by @andy-neuma in #196
switch to GCP based build VM by @andy-neuma in #201
cleanup venv by @andy-neuma in #217
Upstream sync 2024 04 26 by @robertgshaw2-neuralmagic in #211
update workflows to use generated whls by @andy-neuma in #204
Fix nightly benchmark scripts by @dbarbuzzi in #229
Add lm-eval correctness test by @dbarbuzzi in #210
switch to k8s runners by @andy-neuma in #231
Upstream sync 2024 05 05 by @robertgshaw2-neuralmagic in #224
Marlin 2:4 Downstream (for v0.3 release) by @robertgshaw2-neuralmagic in #239
Misc CI/CD updates by @dbarbuzzi in #240
bump version to 0.3.0 by @dhuangnm in #241
[Bugfix] Fix marlin 2:4 kernel crash on H100 by @mgoin in #243
switch runner from aws to gcp for generate whl workflow by @dhuangnm in #242
Add FP8 and marlin 2:4 tests for lm-eval by @mgoin in #244
updates for nm-magic-wand, nightly or release by @andy-neuma in #247
version check patch by @andy-neuma in #251
increase timeouts by @andy-neuma in #253
requirements-dev.txt and workflow patches by @andy-neuma in #255
updates for automation (and release) by @andy-neuma in #265
update install commands by @dhuangnm in #264
Address py38/39 incompatibilities by @dbarbuzzi in #261
[CI/Build] Basic server correctness test by @derekk-nm in #237
bump up version and gate magic-wand version by @dhuangnm in #267
remove release worklfow concurrency limit by @andy-neuma in #270
[CI/Build] include NOTICE in package dist-info by @derekk-nm in #271
switch benchmarking and testing jobs to run using "test" label by @andy-neuma in #273
Handle server startup failure in enter by @dbarbuzzi in #274
Upstream sync 2024 05 19 by @robertgshaw2-neuralmagic in #249
Docker image improvements by @dhuangnm in #276
add latest tag for release docker image by @dhuangnm in #279

New Contributors

@SageMoore made their first contribution in #173
@rgreenberg1 made their first contribution in #178
@derekk-nm made their first contribution in #237

Full Changelog: 0.1.0...v0.4.0

Contributors

dbarbuzzi, tlrmchlsmth, and 9 other contributors

Assets 7

24 May 16:21

mgoin

0.3.0

c9d9587

v0.3.0

Key Features

This release is based on vllm==0.4.2

What's Changed

turn off single gpu scenario by @andy-neuma in #88
Benchmarking : Absolute -> Relative imports by @varun-sundar-rabindranath in #85
Benchmarking : update Gi_per_thread by @varun-sundar-rabindranath in #90
Update README.md with sparsity and quantization explainers by @mgoin in #91
Add notebooks for sparsegpt and marlin compression with nm-vllm by @mgoin in #94
upstream sync 2024-03-04 by @andy-neuma in #89
Update README.md by @robertgshaw2-neuralmagic in #96
Formatting : Fix yapf by @varun-sundar-rabindranath in #101
Lower unstructured sparsity threshold to 40% by @mgoin in #100
Benchmarking : Misc updates by @varun-sundar-rabindranath in #95
upstream merge sync 2024-03-11 by @andy-neuma in #108
Add lm-eval comparison script by @mgoin in #99
Benchmarks : Standardize benchmark result store by @varun-sundar-rabindranath in #87
seed whl centric workflows by @andy-neuma in #116
Benchmarking : Remote push job by @varun-sundar-rabindranath in #92
reverted accidental commit to main by @robertgshaw2-neuralmagic in #119
skipped test for nightly failure by @robertgshaw2-neuralmagic in #120
Turned back on the Marlin tests by @robertgshaw2-neuralmagic in #121
Benchmarking : Prepare for GHA benchmark UI by @varun-sundar-rabindranath in #122
Upstream sync 2024 03 14 by @robertgshaw2-neuralmagic in #127
Benchmark : Update benchmark configs for Nightly by @varun-sundar-rabindranath in #126
Benchmark : Modify/Add workflows/actions for github-action-benchmark by @varun-sundar-rabindranath in #123
Benchmark: fix nightly by @varun-sundar-rabindranath in #131
Fix nightly - 03/18/2024 by @varun-sundar-rabindranath in #136
Upstream sync 2024 03 18 by @robertgshaw2-neuralmagic in #134
Update Dockerfile with extensions support by @mgoin in #107
Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by @varun-sundar-rabindranath in #130
Benchmark Fix : Remove special tokens from warmup prompts by @varun-sundar-rabindranath in #140
Delete .github/pull_request_template.md by @mgoin in #145
Benchmarking : Update readme by @varun-sundar-rabindranath in #144
Initial Layerwise Profiler by @LucasWilkinson in #124
Benchmark Fix : Fix JSON decode error by @varun-sundar-rabindranath in #142
Upstream sync 2024 03 24 by @robertgshaw2-neuralmagic in #143
Benchmark : Fix remote push job by @varun-sundar-rabindranath in #129
Benchmarks : Prune nightly benchmarks by @varun-sundar-rabindranath in #150
Lock lm-evaluation-harness to commit 262f879 by @mgoin in #151
Benchmarks : Copy benchmark results to EFS by @varun-sundar-rabindranath in #148
update readme with nvcc threads option by @varun-sundar-rabindranath in #153
Generate tarball along with wheel build, and upload both in a package to GH by @dhuangnm in #138
switch to nightly whl's by @andy-neuma in #154
whl centric workflow for "remote push" by @andy-neuma in #117
remove low-workload benchmarks that are flaky by @varun-sundar-rabindranath in #156
nightly patches by @andy-neuma in #160
Upstream sync v0.4.0.post1 (merged with upstream-v0.4.0.post1) by @mgoin in #157
Bump version to 0.2 by @mgoin in #165
rename wheels to manylinux and remove unused action by @dhuangnm in #167
Update collect_env.py package list by @mgoin in #169
Add lm-eval full accuracy sweep using GSM8k by @mgoin in #166
Upstream sync 2024 04 08 by @SageMoore in #173
Updated logo in README by @rgreenberg1 in #178
Fix sparsity arg in Engine/ModelArgs by @mgoin in #179
rm model_executor/layers/attention directory since it's been moved by @tlrmchlsmth in #181
Upstream sync 2024 04 12 by @andy-neuma in #183
mm publish workflow by @andy-neuma in #193
GCP related build workflow updates by @andy-neuma in #196
switch to GCP based build VM by @andy-neuma in #201
cleanup venv by @andy-neuma in #217
Upstream sync 2024 04 26 by @robertgshaw2-neuralmagic in #211
update workflows to use generated whls by @andy-neuma in #204
Fix nightly benchmark scripts by @dbarbuzzi in #229
Add lm-eval correctness test by @dbarbuzzi in #210
switch to k8s runners by @andy-neuma in #231
Upstream sync 2024 05 05 by @robertgshaw2-neuralmagic in #224
Marlin 2:4 Downstream (for v0.3 release) by @robertgshaw2-neuralmagic in #239
Misc CI/CD updates by @dbarbuzzi in #240
bump version to 0.3.0 by @dhuangnm in #241
[Cherrypick, Bugfix] Fix marlin 2:4 kernel crash on H100 by @mgoin in #245
[cherry-pick] Update gen-whl.yml by @dhuangnm in #246
updates for nm-magic-wand, nightly or release (#247) by @andy-neuma in #248

New Contributors

@SageMoore made their first contribution in #173
@rgreenberg1 made their first contribution in #178
@dbarbuzzi made their first contribution in #229

Full Changelog: 0.1.0...0.3.0

Contributors

dbarbuzzi, tlrmchlsmth, and 8 other contributors

Assets 7

10 Apr 19:10

mgoin

0.2.0

e752ec7

v0.2.0

Key Features

This release is based on vllm==0.4.0.post1

New model architectures supported! DbrxForCausalLM, CohereForCausalLM (Command-R), JAISLMHeadModel, LlavaForConditionalGeneration (experimental vision LM), OrionForCausalLM, Qwen2MoeForCausalLM, StableLmForCausalLM, Starcoder2ForCausalLM, XverseForCausalLM
Automated benchmarking
Code coverage reporting
lm-evaluation-harness nightly accuracy testing
Layerwise Profiling for the inference graph (#124)

What's Changed

turn off single gpu scenario by @andy-neuma in #88
Benchmarking : Absolute -> Relative imports by @varun-sundar-rabindranath in #85
Benchmarking : update Gi_per_thread by @varun-sundar-rabindranath in #90
Update README.md with sparsity and quantization explainers by @mgoin in #91
Add notebooks for sparsegpt and marlin compression with nm-vllm by @mgoin in #94
upstream sync 2024-03-04 by @andy-neuma in #89
Update README.md by @robertgshaw2-neuralmagic in #96
Formatting : Fix yapf by @varun-sundar-rabindranath in #101
Lower unstructured sparsity threshold to 40% by @mgoin in #100
Benchmarking : Misc updates by @varun-sundar-rabindranath in #95
upstream merge sync 2024-03-11 by @andy-neuma in #108
Add lm-eval comparison script by @mgoin in #99
Benchmarks : Standardize benchmark result store by @varun-sundar-rabindranath in #87
seed whl centric workflows by @andy-neuma in #116
Benchmarking : Remote push job by @varun-sundar-rabindranath in #92
reverted accidental commit to main by @robertgshaw2-neuralmagic in #119
skipped test for nightly failure by @robertgshaw2-neuralmagic in #120
Turned back on the Marlin tests by @robertgshaw2-neuralmagic in #121
Benchmarking : Prepare for GHA benchmark UI by @varun-sundar-rabindranath in #122
Upstream sync 2024 03 14 by @robertgshaw2-neuralmagic in #127
Benchmark : Update benchmark configs for Nightly by @varun-sundar-rabindranath in #126
Benchmark : Modify/Add workflows/actions for github-action-benchmark by @varun-sundar-rabindranath in #123
Benchmark: fix nightly by @varun-sundar-rabindranath in #131
Fix nightly - 03/18/2024 by @varun-sundar-rabindranath in #136
Upstream sync 2024 03 18 by @robertgshaw2-neuralmagic in #134
Update Dockerfile with extensions support by @mgoin in #107
Benchmark : Turn-off nightly multi-gpu benchmarks temporarily by @varun-sundar-rabindranath in #130
Benchmark Fix : Remove special tokens from warmup prompts by @varun-sundar-rabindranath in #140
Delete .github/pull_request_template.md by @mgoin in #145
Benchmarking : Update readme by @varun-sundar-rabindranath in #144
Initial Layerwise Profiler by @LucasWilkinson in #124
Benchmark Fix : Fix JSON decode error by @varun-sundar-rabindranath in #142
Upstream sync 2024 03 24 by @robertgshaw2-neuralmagic in #143
Benchmark : Fix remote push job by @varun-sundar-rabindranath in #129
Benchmarks : Prune nightly benchmarks by @varun-sundar-rabindranath in #150
Lock lm-evaluation-harness to commit 262f879 by @mgoin in #151
Benchmarks : Copy benchmark results to EFS by @varun-sundar-rabindranath in #148
update readme with nvcc threads option by @varun-sundar-rabindranath in #153
Generate tarball along with wheel build, and upload both in a package to GH by @dhuangnm in #138
switch to nightly whl's by @andy-neuma in #154
whl centric workflow for "remote push" by @andy-neuma in #117
remove low-workload benchmarks that are flaky by @varun-sundar-rabindranath in #156
nightly patches by @andy-neuma in #160
Upstream sync v0.4.0.post1 (merged with upstream-v0.4.0.post1) by @mgoin in #157
Bump version to 0.2 by @mgoin in #165

New Contributors

@dhuangnm made their first contribution in #138

Full Changelog: 0.1.0...0.2.0

Contributors

mgoin, varun-sundar-rabindranath, and 4 other contributors

Assets 7

05 Mar 17:08

mgoin

0.1.0

007ada5

v0.1.0

Initial release of 🪄 nm-vllm 🪄

nm-vllm is Neural Magic's fork of vLLM with an opinionated focus on incorporating the latest LLM optimizations like quantization and sparsity for enhanced performance.

This release is based on vllm==0.3.2

Key Features

This first release focuses on our initial LLM performance contributions through support for Marlin, an extremely optimized FP16xINT4 matmul kernel, and weight sparsity acceleration.

Model Inference with Marlin (4-bit Quantization)

Marlin is enabled automatically if a quantized model has the "is_marlin_format": true flag present in it's quant_config.json

from vllm import LLM
model = LLM("neuralmagic/llama-2-7b-chat-marlin")
print(model.generate("Hello quantized world!")

Optionally, you can specify it explicitly by setting quantization="marlin".

Model Inference with Weight Sparsity

nm-vllm includes support for newly-developed sparse inference kernels, which provides both memory reduction and acceleration of sparse models leveraging sparsity.

Here is an example running a 50% sparse OpenHermes 2.5 Mistral 7B model fine-tuned for instruction-following:

from vllm import LLM, SamplingParams

model = LLM(
    "nm-testing/OpenHermes-2.5-Mistral-7B-pruned50",
    sparsity="sparse_w16a16",
    max_model_len=1024
)

sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Hello my name is", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

There is also support for semi-structured 2:4 sparsity using the sparsity="semi_structured_sparse_w16a16" argument:

from vllm import LLM, SamplingParams

model = LLM("nm-testing/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")
sampling_params = SamplingParams(max_tokens=100, temperature=0)
outputs = model.generate("Once upon a time, ", sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

What's Changed

Sparsity by @robertgshaw2-neuralmagic in #1
Sparse fused gemm integration by @LucasWilkinson in #12
Abf149/fix semi structured sparse by @afeldman-nm in #16
Enable bfloat16 for sparse_w16a16 by @mgoin in #18
seed workflow by @andy-neuma in #19
Add bias support for sparse layers by @mgoin in #25
Use naive decompress for SM<8.0 by @mgoin in #32
Varun/benchmark workflow by @varun-sundar-rabindranath in #28
initial GHA workflows for "build test" and "remote push" by @andy-neuma in #27
Only import magic_wand if sparsity is enabled by @mgoin in #37
Sparsity fix by @robertgshaw2-neuralmagic in #40
Add NM benchmarking scripts & utils by @varun-sundar-rabindranath in #14
Rs/marlin downstream v0.3.2 by @robertgshaw2-neuralmagic in #43
Update README.md by @mgoin in #47
additional updates to "bump-to-v0.3.2" by @andy-neuma in #39
Add empty tensor initialization to LazyCompressedParameter by @alexm-nm in #53
Update arg_utils.py with semi_structured_sparse_w16a16 by @mgoin in #45
additions for bump to v0.3.2 by @andy-neuma in #50
formatting patch by @andy-neuma in #54
Rs/bump main to v0.3.2 by @robertgshaw2-neuralmagic in #38
Update setup.py naming by @mgoin in #44
Loudly reject compression when the tensor isn't sparse enough by @mgoin in #55
Benchmarking : Fix server response aggregation by @varun-sundar-rabindranath in #51
initial whl workflow by @andy-neuma in #57
GHA Benchmark : Automatic benchmarking on manual trigger by @varun-sundar-rabindranath in #46
delete NOTICE.txt by @andy-neuma in #63
pin GPU and use "--forked" for some tests by @andy-neuma in #58
obsfucate pypi server ip by @andy-neuma in #64
add HF cache by @andy-neuma in #65
Rs/sparse integration test clean 2 by @robertgshaw2-neuralmagic in #67
neuralmagic-vllm -> nm-vllm by @mgoin in #69
Mark files that have been modified by Neural Magic by @tlrmchlsmth in #70
Benchmarking - Add tensor_parallel_size arg for multi-gpu benchmarking by @varun-sundar-rabindranath in #66
Jfinks license by @jeanniefinks in #72
Add Nightly benchmark workflow by @varun-sundar-rabindranath in #62
Rs/licensing by @robertgshaw2-neuralmagic in #68
Rs/model integration tests logprobs by @robertgshaw2-neuralmagic in #71
fixes issue identified by derek by @robertgshaw2-neuralmagic in #83
Add nm-vllm[sparse]+nm-vllm[sparsity] extras, move version to 0.1 by @mgoin in #76
Update setup.py by @mgoin in #82
Fixes the multi-gpu tests by @robertgshaw2-neuralmagic in #79
various updates to "build whl" workflow by @andy-neuma in #59
Change magic_wand to nm-magic-wand by @mgoin in #86

New Contributors

@LucasWilkinson made their first contribution in #12
@alexm-nm made their first contribution in #53
@tlrmchlsmth made their first contribution in #70
@jeanniefinks made their first contribution in #72

Full Changelog: https://github.com/neuralmagic/nm-vllm/commits/0.1.0

Contributors

tlrmchlsmth, mgoin, and 7 other contributors

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key Features

What's Changed

Contributors

Key Features

What's Changed

New Contributors

Contributors

Key Features

What's Changed

New Contributors

Contributors

Key Features

What's Changed

New Contributors

Contributors

Initial release of 🪄 nm-vllm 🪄

Key Features

Model Inference with Marlin (4-bit Quantization)

Model Inference with Weight Sparsity

What's Changed

New Contributors

Contributors

Releases: neuralmagic/nm-vllm

v0.5.0

Key Features

What's Changed

Contributors

v0.4.0

Key Features

What's Changed

New Contributors

Contributors

v0.3.0

Key Features

What's Changed

New Contributors

Contributors

v0.2.0

Key Features

What's Changed

New Contributors

Contributors

v0.1.0

Initial release of 🪄 nm-vllm 🪄

Key Features

Model Inference with Marlin (4-bit Quantization)

Model Inference with Weight Sparsity

What's Changed

New Contributors

Contributors