Upstream sync 2024 04 26 #211

robertgshaw2-neuralmagic · 2024-04-26T21:41:22Z

Upstream sync 2024 04 26 (#211)

SUMMARY:
Merge commits from vllm-project@a37d815 to vllm-project@b6dcb4d

Note that vllm-project@a37d815 is NOT included in this merge.

…4023)

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…-project#4037)

…oject#3476)

…4021)

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

….9. (vllm-project#4092)

Co-authored-by: Simon Mo <simon.mo@hey.com>

…ine (vllm-project#3894)

[Core][Distributed] use existing torch.cuda.device context manager (vllm-project#4318)

…oject#4279)

This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

Co-authored-by: Simon Mo <simon.mo@hey.com>

…oject#4347)

…formers 4.40.0 (vllm-project#4324) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…ssing. (vllm-project#4213)

Co-authored-by: Caio Mendes <caiocesart@microsoft.com>

andy-neuma

cool.

.github/scripts/run-tests

rkooo567 and others added 30 commits April 26, 2024 21:04

[Test] Test multiple attn backend for chunked prefill. (vllm-project#…

eb2428e

…4023)

[Bugfix] fix type hint for py 3.8 (vllm-project#4036)

71760ce

[Misc] Fix typo in scheduler.py (vllm-project#4022)

405a695

[mypy] Add mypy type annotation part 1 (vllm-project#4006)

801ad22

[Core] fix custom allreduce default value (vllm-project#4040)

58911ec

Fix triton compilation issue (vllm-project#3984)

094013d

Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[Bugfix] Fix LoRA bug (vllm-project#4032)

0b5c9ea

[CI/Test] expand ruff and yapf for all supported python version (vllm…

b35bba7

…-project#4037)

[Bugfix] More type hint fixes for py 3.8 (vllm-project#4039)

0356684

[Core][Distributed] improve logging for init dist (vllm-project#4042)

0f5a490

[Bugfix] fix_log_time_in_metrics (vllm-project#4050)

a738567

[Bugfix] fix_small_bug_in_neuron_executor (vllm-project#4051)

5444860

[Kernel] Add punica dimension for Baichuan-13B (vllm-project#4053)

7dd0af0

[Frontend] [Core] feat: Add model loading using tensorizer (vllm-pr…

fab8ca1

…oject#3476)

[Core] avoid too many cuda context by caching p2p test (vllm-project#…

f39e0b5

…4021)

[BugFix] Fix tensorizer extra in setup.py (vllm-project#4072)

de26ef7

[Docs] document that mixtral 8x22b is supported (vllm-project#4073)

d3f28b1

[Misc] Upgrade triton to 2.2.0 (vllm-project#4061)

0012b9b

[Bugfix] Fix filelock version requirement (vllm-project#4075)

6bd8ad1

[Misc][Minor] Fix CPU block num log in CPUExecutor. (vllm-project#4088)

5c33590

[Core] Simplifications to executor classes (vllm-project#4071)

3d28207

[Doc] Add better clarity for tensorizer usage (vllm-project#4090)

0008bf9

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

[Bugfix] Fix ray workers profiling with nsight (vllm-project#4095)

6800f95

[Typing] Fix Sequence type GenericAlias only available after Python 3…

43af0d0

….9. (vllm-project#4092)

[Core] Fix engine-use-ray broken (vllm-project#4105)

f045612

LM Format Enforcer Guided Decoding Support (vllm-project#3868)

bc92515

Co-authored-by: Simon Mo <simon.mo@hey.com>

[Core] Refactor model loading code (vllm-project#4097)

2986e80

[Speculative decoding 6/9] Integrate speculative decoding with LLMEng…

945a6b7

…ine (vllm-project#3894)

[Misc] [CI] Fix CI failure caught after merge (vllm-project#4126)

a84676e

[CI] Move CPU/AMD tests to after wait (vllm-project#4123)

f56e1ae

youkaichao and others added 24 commits April 26, 2024 21:09

[CI][Build] change pynvml to nvidia-ml-py (vllm-project#4302)

ffc3593

[Misc] Reduce supported Punica dtypes (vllm-project#4304)

cc2c2f2

[Core][Distributed] use existing torch.cuda.device (vllm-project#4318)

afd3970

[Core][Distributed] use existing torch.cuda.device context manager (vllm-project#4318)

[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (vllm-pr…

27ced33

…oject#4279)

[Doc] Add note for docker user (vllm-project#4340)

50f4e48

Co-authored-by: Simon Mo <simon.mo@hey.com>

[Misc] Use public API in benchmark_throughput (vllm-project#4300)

382fb33

[Model] Adds Phi-3 support (vllm-project#4298)

e207f23

[Core] Move ray_utils.py from engine to executor package (vllm-pr…

b290035

…oject#4347)

[Bugfix][Model] Refactor OLMo model to support new HF format in trans…

bd92e76

…formers 4.40.0 (vllm-project#4324) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[CI/Build] Adding functionality to reset the node's GPUs before proce…

74e20c2

…ssing. (vllm-project#4213)

[Doc] README Phi-3 name fix. (vllm-project#4372)

0f38d71

Co-authored-by: Caio Mendes <caiocesart@microsoft.com>

[Core]refactor aqlm quant ops (vllm-project#4351)

fff6cd2

[Mypy] Typing lora folder (vllm-project#4337)

9bb7eff

[Misc] Fix flash attention backend log (vllm-project#4368)

1917d86

./format, fixed tests failing in automation due to ray.init()

b6d61b2

fixed typo in run tests script

6dcf181

fixed sparsity issues with model loader refactor

a8b853a

format

b7fb44b

linter

8177a4b

ruff ruff

96219f1

updated tests to skip starcoder for now

16f1aa2

yapf

475ec0a

Merge branch 'main' into upstream-sync-2024-04-26

e5da6ba

SageMoore approved these changes Apr 30, 2024

View reviewed changes

andy-neuma approved these changes Apr 30, 2024

View reviewed changes

mgoin reviewed Apr 30, 2024

View reviewed changes

.github/scripts/run-tests Show resolved Hide resolved

mgoin approved these changes Apr 30, 2024

View reviewed changes

andy-neuma merged commit 8f55a0c into main May 2, 2024
12 checks passed

andy-neuma deleted the upstream-sync-2024-04-26 branch May 2, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream sync 2024 04 26 #211

Upstream sync 2024 04 26 #211

robertgshaw2-neuralmagic commented Apr 26, 2024 •

edited

Loading

andy-neuma left a comment

Upstream sync 2024 04 26 #211

Upstream sync 2024 04 26 #211

Conversation

robertgshaw2-neuralmagic commented Apr 26, 2024 • edited Loading

andy-neuma left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Apr 26, 2024 •

edited

Loading