Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade the latest vLLM version 09/18 #4

Merged
merged 247 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
247 commits
Select commit Hold shift + click to select a range
2deb029
[Performance][BlockManagerV2] Mark prefix cache block as computed aft…
comaniac Aug 26, 2024
6653040
[Misc] Update `qqq` to use vLLMParameters (#7805)
dsikka Aug 26, 2024
dd9857f
[Misc] Update `gptq_marlin_24` to use vLLMParameters (#7762)
dsikka Aug 26, 2024
05826c8
[misc] fix custom allreduce p2p cache file generation (#7853)
youkaichao Aug 26, 2024
760e9f7
[Bugfix] neuron: enable tensor parallelism (#7562)
omrishiv Aug 26, 2024
015e6cc
[Misc] Update compressed tensors lifecycle to remove `prefix` from `c…
dsikka Aug 27, 2024
2eedede
[Core] Asynchronous Output Processor (#7049)
megha95 Aug 27, 2024
39178c7
[Tests] Disable retries and use context manager for openai client (#7…
njhill Aug 27, 2024
64cc644
[core][torch.compile] discard the compile for profiling (#7796)
youkaichao Aug 27, 2024
9606c71
Revert #7509 (#7887)
comaniac Aug 27, 2024
6fc4e6e
[Model] Add Mistral Tokenization to improve robustness and chat encod…
patrickvonplaten Aug 27, 2024
9db6421
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897)
Isotr0py Aug 27, 2024
076169f
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810)
jikunshang Aug 27, 2024
42e932c
[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237)
alexeykondrat Aug 27, 2024
b09c755
[Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916)
Isotr0py Aug 27, 2024
ed6f002
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924)
youkaichao Aug 27, 2024
fc91188
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
dsikka Aug 27, 2024
345be0e
[benchmark] Update TGI version (#7917)
philschmid Aug 27, 2024
5340a2d
[Model] Add multi-image input support for LLaVA-Next offline inferenc…
zifeitong Aug 27, 2024
9c71c97
[mypy] Enable mypy type checking for `vllm/core` (#7229)
jberkhahn Aug 27, 2024
fab5f53
[Core][VLM] Stack multimodal tensors to represent multiple images wit…
petersalas Aug 28, 2024
bc6e42a
[hardware][rocm] allow rocm to override default env var (#7926)
youkaichao Aug 28, 2024
c166e7e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add che…
bnellnm Aug 28, 2024
51f86bf
[mypy][CI/Build] Fix mypy errors (#7929)
DarkLight1337 Aug 28, 2024
f508e03
[Core] Async_output_proc: Add virtual engine support (towards pipelin…
alexm-neuralmagic Aug 28, 2024
e358053
[Performance] Enable chunked prefill and prefix caching together (#7753)
comaniac Aug 28, 2024
f52a43a
[ci][test] fix pp test failure (#7945)
youkaichao Aug 28, 2024
98c12cf
[Doc] fix the autoAWQ example (#7937)
stas00 Aug 28, 2024
ef9baee
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948)
DarkLight1337 Aug 28, 2024
b98cc28
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when availabl…
pavanimajety Aug 28, 2024
e5697d1
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize …
rasmith Aug 28, 2024
eeffde1
[TPU] Upgrade PyTorch XLA nightly (#7967)
WoosukKwon Aug 28, 2024
8c56e57
[Doc] fix 404 link (#7966)
stas00 Aug 28, 2024
fdd9daa
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#…
mzusman Aug 28, 2024
3cdfe1f
[Bugfix] Make torch registration of punica ops optional (#7970)
bnellnm Aug 28, 2024
ce6bf3a
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
youkaichao Aug 28, 2024
af59df0
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961)
mgoin Aug 28, 2024
4289cad
[Frontend] Minor optimizations to zmq decoupled front-end (#7957)
njhill Aug 29, 2024
a7f65c2
[torch.compile] remove reset (#7975)
youkaichao Aug 29, 2024
74d5543
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974)
petersalas Aug 29, 2024
ef99a78
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when …
youkaichao Aug 29, 2024
f205c09
[Bugfix] Unify rank computation across regular decoding and speculati…
jmkuebler Aug 29, 2024
3f60f22
[Core] Combine async postprocessor and multi-step (#7921)
alexm-neuralmagic Aug 29, 2024
6b34215
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFi…
pavanimajety Aug 29, 2024
c334b18
extend cuda graph size for H200 (#7894)
kushanam Aug 29, 2024
d78789a
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tenso…
Isotr0py Aug 29, 2024
86a677d
[misc] update tpu int8 to use new vLLM Parameters (#7973)
dsikka Aug 29, 2024
257afc3
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
hbikki Aug 29, 2024
4664cea
support bitsandbytes 8-bit and FP4 quantized models (#7445)
chenqianfzh Aug 29, 2024
0c785d3
Add more percentiles and latencies (#7759)
wschin Aug 29, 2024
4abed65
[VLM] Disallow overflowing `max_model_len` for multimodal models (#7998)
DarkLight1337 Aug 30, 2024
428dd14
[Core] Logprobs support in Multi-step (#7652)
afeldman-nm Aug 30, 2024
80c7b08
[TPU] Async output processing for TPU (#8011)
WoosukKwon Aug 30, 2024
34a0e96
[Kernel] changing fused moe kernel chunk size default to 32k (#7995)
avshalomman Aug 30, 2024
dc13e99
[MODEL] add Exaone model support (#7819)
nayohan Aug 30, 2024
2148441
[TPU] Support single and multi-host TPUs on GKE (#7613)
richardsliu Aug 30, 2024
afd39a4
[Bugfix] Fix import error in Exaone model (#8034)
DarkLight1337 Aug 30, 2024
f97be32
[VLM][Model] TP support for ViTs (#7186)
ChristopherCho Aug 30, 2024
98cef6a
[Core] Increase default `max_num_batched_tokens` for multimodal model…
DarkLight1337 Aug 30, 2024
058344f
[Frontend]-config-cli-args (#7737)
KaunilD Aug 30, 2024
2684efc
[TPU][Bugfix] Fix tpu type api (#8035)
WoosukKwon Aug 30, 2024
1248e85
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
wenxcs Aug 30, 2024
622f8ab
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013)
pavanimajety Aug 31, 2024
d05f0a9
[Bugfix] Fix import error in Phi-3.5-MoE (#8052)
DarkLight1337 Aug 31, 2024
4f5d844
[Bugfix] Fix ModelScope models in v0.5.5 (#8037)
NickLucche Aug 31, 2024
8423aef
[BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059)
robertgshaw2-neuralmagic Aug 31, 2024
5231f08
[Frontend][VLM] Add support for multiple multi-modal items (#8049)
ywang96 Aug 31, 2024
5b86b19
[Misc] Optional installation of audio related packages (#8063)
ywang96 Sep 1, 2024
f8d6014
[Model] Add Granite model (#7436)
shawntan Sep 2, 2024
e6a26ed
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244)
LiuXiaoxuanPKU Sep 2, 2024
e2b2aa5
[TPU] Align worker index with node boundary (#7932)
WoosukKwon Sep 2, 2024
4ca65a9
[Core][Bugfix] Accept GGUF model without .gguf extension (#8056)
Isotr0py Sep 2, 2024
dd2a6a8
[Bugfix] Fix internlm2 tensor parallel inference (#8055)
Isotr0py Sep 2, 2024
6e36f4f
improve chunked prefill performance
noooop Sep 2, 2024
0fbc669
[Bugfix] Fix single output condition in output processor (#7881)
WoosukKwon Sep 3, 2024
ec26653
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backe…
Isotr0py Sep 3, 2024
bd852f2
[Performance] Enable chunked prefill and prefix caching together (#8120)
comaniac Sep 3, 2024
95a178f
[CI] Only PR reviewers/committers can trigger CI on PR (#8124)
khluu Sep 3, 2024
6d646d0
[Core] Optimize Async + Multi-step (#8050)
alexm-neuralmagic Sep 3, 2024
652c83b
[Misc] Raise a more informative exception in add/remove_logger (#7750)
Yard1 Sep 3, 2024
c02638e
[CI/Build] make pip install vllm work in macos (for import only) (#8118)
tomeras91 Sep 3, 2024
f1575dc
[ci] Fix GHA workflow (#8129)
khluu Sep 3, 2024
0af3abe
[TPU][Bugfix] Fix next_token_ids shape (#8128)
WoosukKwon Sep 3, 2024
dc0b606
[CI] Change PR remainder to avoid at-mentions (#8134)
simon-mo Sep 3, 2024
2188a60
[Misc] Update `GPTQ` to use `vLLMParameters` (#7976)
dsikka Sep 3, 2024
d4db9f5
[Benchmark] Add `--async-engine` option to benchmark_throughput.py (#…
njhill Sep 4, 2024
61f4a93
[TPU][Bugfix] Use XLA rank for persistent cache path (#8137)
WoosukKwon Sep 4, 2024
e16fa99
[Misc] Update fbgemmfp8 to use `vLLMParameters` (#7972)
dsikka Sep 4, 2024
2be8ec6
[Model] Add Ultravox support for multiple audio chunks (#7963)
petersalas Sep 4, 2024
855c262
[Frontend] Multimodal support in offline chat (#8098)
DarkLight1337 Sep 4, 2024
ccd7207
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env (#8103)
haitwang-cloud Sep 4, 2024
d331156
[Bugfix] remove post_layernorm in siglip (#8106)
wnma3mz Sep 4, 2024
2ad2e56
[MISC] Consolidate FP8 kv-cache tests (#8131)
comaniac Sep 4, 2024
d1dec64
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
alexeykondrat Sep 4, 2024
561d6f8
[CI] Change test input in Gemma LoRA test (#8163)
WoosukKwon Sep 4, 2024
e02ce49
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…
K-Mistele Sep 4, 2024
77d9e51
[MISC] Replace input token throughput with total token throughput (#8…
comaniac Sep 4, 2024
008cf88
[Neuron] Adding support for adding/ overriding neuron configuration a…
hbikki Sep 4, 2024
32e7db2
Bump version to v0.6.0 (#8166)
simon-mo Sep 4, 2024
e01c2be
[Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161)
mmcelaney Sep 4, 2024
1afc931
[bugfix] >1.43 constraint for openai (#8169)
SolitaryThinker Sep 5, 2024
4624d98
[Misc] Clean up RoPE forward_native (#8076)
WoosukKwon Sep 5, 2024
ba262c4
[ci] Mark LoRA test as soft-fail (#8160)
khluu Sep 5, 2024
e39ebf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8…
elfiegg Sep 5, 2024
288a938
[Doc] Indicate more information about supported modalities (#8181)
DarkLight1337 Sep 5, 2024
8685ba1
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parall…
Manikandan-Thangaraj-ZS0321 Sep 5, 2024
9da25a8
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
alex-jw-brooks Sep 5, 2024
2ee4528
Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165)
mgoin Sep 5, 2024
2febcf2
[Documentation][Spec Decode] Add documentation about lossless guarant…
sroy745 Sep 5, 2024
db3bf7c
[Core] Support load and unload LoRA in api server (#6566)
Jeffwan Sep 6, 2024
baa5467
[BugFix] Fix Granite model configuration (#8216)
njhill Sep 6, 2024
e5cab71
[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191)
afeldman-nm Sep 6, 2024
de80783
[Misc] Use ray[adag] dependency instead of cuda (#7938)
ruisearch42 Sep 6, 2024
1447c97
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
alexeykondrat Sep 6, 2024
9db52ea
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize…
rasmith Sep 6, 2024
23f3222
[Misc] Remove `SqueezeLLM` (#8220)
dsikka Sep 6, 2024
29f49cd
[Model] Allow loading from original Mistral format (#8168)
patrickvonplaten Sep 6, 2024
12dd715
[misc] [doc] [frontend] LLM torch profiler support (#7943)
SolitaryThinker Sep 7, 2024
41e95c5
[Bugfix] Fix Hermes tool call chat template bug (#8256)
K-Mistele Sep 7, 2024
2f707fc
[Model] Multi-input support for LLaVA (#8238)
DarkLight1337 Sep 7, 2024
795b662
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_ser…
wschin Sep 7, 2024
ce2702a
[tpu][misc] fix typo (#8260)
youkaichao Sep 7, 2024
9f68e00
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
DarkLight1337 Sep 7, 2024
e807125
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
Isotr0py Sep 7, 2024
36bf815
[Model][VLM] Decouple weight loading logic for `Paligemma` (#8269)
Isotr0py Sep 7, 2024
b962ee1
ppc64le: Dockerfile fixed, and a script for buildkite (#8026)
sumitd2 Sep 7, 2024
cfe712b
[CI/Build] Use python 3.12 in cuda image (#8133)
joerunde Sep 7, 2024
4ef41b8
[Bugfix] Fix async postprocessor in case of preemption (#8267)
alexm-neuralmagic Sep 8, 2024
08287ef
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format…
K-Mistele Sep 9, 2024
58fcc85
[Frontend] Add progress reporting to run_batch.py (#8060)
alugowski Sep 9, 2024
f9b4a2d
[Bugfix] Correct adapter usage for cohere and jamba (#8292)
vladislavkruglikov Sep 9, 2024
c7cb5c3
[Misc] GPTQ Activation Ordering (#8135)
kylesayrs Sep 9, 2024
6cd5e5b
[Misc] Fused MoE Marlin support for GPTQ (#8217)
dsikka Sep 10, 2024
a1d8742
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (…
simon-mo Sep 10, 2024
da1a844
[Bugfix] Fix missing `post_layernorm` in CLIP (#8155)
DarkLight1337 Sep 10, 2024
6234385
[CI/Build] enable ccache/scccache for HIP builds (#8327)
dtrifiro Sep 10, 2024
8c054b7
[Frontend] Clean up type annotations for mistral tokenizer (#8314)
DarkLight1337 Sep 10, 2024
f421f3c
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that…
alexeykondrat Sep 10, 2024
02751a7
Fix ppc64le buildkite job (#8309)
sumitd2 Sep 10, 2024
5faedf1
[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224)
kevin314 Sep 10, 2024
04e7c4e
[Misc] remove peft as dependency for prompt models (#8162)
prashantgupta24 Sep 10, 2024
b1f3e18
[MISC] Keep chunked prefill enabled by default with long context when…
comaniac Sep 10, 2024
22f3a4b
[Bugfix] lookahead block table with cuda graph max capture (#8340)
alexm-neuralmagic Sep 10, 2024
1d5e397
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (#8172)
SolitaryThinker Sep 10, 2024
94144e7
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043)
tlrmchlsmth Sep 10, 2024
e497b8a
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (#8329)
jeejeelee Sep 11, 2024
1230263
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parall…
Isotr0py Sep 11, 2024
efcf946
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (…
pavanimajety Sep 11, 2024
6a512a0
[model] Support for Llava-Next-Video model (#7559)
TKONIY Sep 11, 2024
cea95df
[Frontend] Create ErrorResponse instead of raising exceptions in run_…
pooyadavoodi Sep 11, 2024
3b7fea7
[Model][VLM] Add Qwen2-VL model support (#7905)
fyabc Sep 11, 2024
0b952af
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
bigPYJ1151 Sep 11, 2024
aea02f3
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investiga…
alexeykondrat Sep 11, 2024
7015417
[Bugfix] Add missing attributes in mistral tokenizer (#8364)
DarkLight1337 Sep 11, 2024
73202db
[Kernel][Misc] register ops to prevent graph breaks (#6917)
bnellnm Sep 11, 2024
8baa454
[Misc] Move device options to a single place (#8322)
akx Sep 11, 2024
775f00f
[Speculative Decoding] Test refactor (#8317)
LiuXiaoxuanPKU Sep 11, 2024
d394787
Pixtral (#8377)
patrickvonplaten Sep 11, 2024
3fd2b0d
Bump version to v0.6.1 (#8379)
simon-mo Sep 11, 2024
a65cb16
[MISC] Dump model runner inputs when crashing (#8305)
comaniac Sep 12, 2024
f842a7a
[misc] remove engine_use_ray (#8126)
youkaichao Sep 12, 2024
b71c956
[TPU] Use Ray for default distributed backend (#8389)
WoosukKwon Sep 12, 2024
b6c75e1
Fix the AMD weight loading tests (#8390)
mgoin Sep 12, 2024
5a60699
[Bugfix]: Fix the logic for deciding if tool parsing is used (#8366)
tomeras91 Sep 12, 2024
1bf2dd9
[Gemma2] add bitsandbytes support for Gemma2 (#8338)
blueyo0 Sep 12, 2024
295c473
[Misc] Raise error when using encoder/decoder model with cpu backend …
kevin314 Sep 12, 2024
42ffba1
[Misc] Use RoPE cache for MRoPE (#8396)
WoosukKwon Sep 12, 2024
7de49aa
[torch.compile] hide slicing under custom op for inductor (#8384)
youkaichao Sep 12, 2024
520ca38
[Hotfix][VLM] Fixing max position embeddings for Pixtral (#8399)
ywang96 Sep 12, 2024
e56bf27
[Bugfix] Fix InternVL2 inference with various num_patches (#8375)
Isotr0py Sep 12, 2024
c6202da
[Model] Support multiple images for qwen-vl (#8247)
alex-jw-brooks Sep 12, 2024
8a23e93
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instanc…
lnykww Sep 12, 2024
1f0c75a
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (#8423)
vegaluisjose Sep 12, 2024
f2e263b
[Bugfix] Offline mode fix (#8376)
joerunde Sep 12, 2024
a6c0f36
[multi-step] add flashinfer backend (#7928)
SolitaryThinker Sep 12, 2024
551ce01
[Core] Add engine option to return only deltas or final output (#7381)
njhill Sep 12, 2024
0198772
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427)
alexm-neuralmagic Sep 12, 2024
c163694
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix cac…
ywang96 Sep 12, 2024
b61bd98
[CI/Build] Disable multi-node test for InternVL2 (#8428)
ywang96 Sep 12, 2024
d31174a
[Hotfix][Pixtral] Fix multiple images bugs (#8415)
patrickvonplaten Sep 12, 2024
a480939
[Bugfix] Fix weight loading issue by rename variable. (#8293)
wenxcs Sep 12, 2024
360ddbd
[Misc] Update Pixtral example (#8431)
ywang96 Sep 13, 2024
8f44a92
[BugFix] fix group_topk (#8430)
dsikka Sep 13, 2024
5ec9c0f
[Core] Factor out input preprocessing to a separate class (#7329)
DarkLight1337 Sep 13, 2024
40c3965
[Bugfix] Mapping physical device indices for e2e test utils (#8290)
ShangmingCai Sep 13, 2024
3f79bc3
[Bugfix] Bump fastapi and pydantic version (#8435)
DarkLight1337 Sep 13, 2024
8427550
[CI/Build] Update pixtral tests to use JSON (#8436)
DarkLight1337 Sep 13, 2024
6821020
[Bugfix] Fix async log stats (#8417)
alexm-neuralmagic Sep 13, 2024
ba77527
[bugfix] torch profiler bug for single gpu with GPUExecutor (#8354)
SolitaryThinker Sep 13, 2024
acda0b3
bump version to v0.6.1.post1 (#8440)
simon-mo Sep 13, 2024
9b4a3b2
[CI/Build] Enable InternVL2 PP test only on single node (#8437)
Isotr0py Sep 13, 2024
cab69a1
[doc] recommend pip instead of conda (#8446)
youkaichao Sep 13, 2024
06311e2
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442)
jeejeelee Sep 13, 2024
a246912
[misc][ci] fix quant test (#8449)
youkaichao Sep 13, 2024
ecd7a1d
[Installation] Gate FastAPI version for Python 3.8 (#8456)
DarkLight1337 Sep 13, 2024
0a4806f
[plugin][torch.compile] allow to add custom compile backend (#8445)
youkaichao Sep 13, 2024
a84e598
[CI/Build] Reorganize models tests (#7820)
DarkLight1337 Sep 13, 2024
f57092c
[Doc] Add oneDNN installation to CPU backend documentation (#8467)
Isotr0py Sep 13, 2024
18e9e1f
[HotFix] Fix final output truncation with stop string + streaming (#8…
njhill Sep 13, 2024
9ba0817
bump version to v0.6.1.post2 (#8473)
simon-mo Sep 13, 2024
8517252
[Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
jikunshang Sep 13, 2024
1ef0d2e
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
charlifu Sep 14, 2024
8a0cf1d
[Model] support minicpm3 (#8297)
SUDA-HLT-ywfang Sep 14, 2024
a36e070
[torch.compile] fix functionalization (#8480)
youkaichao Sep 14, 2024
47790f3
[torch.compile] add a flag to disable custom op (#8488)
youkaichao Sep 14, 2024
50e9ec4
[TPU] Implement multi-step scheduling (#8489)
WoosukKwon Sep 14, 2024
3724d5f
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by upda…
chrisociepa Sep 15, 2024
fc990f9
[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kern…
Isotr0py Sep 15, 2024
a091e2d
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
ElizaWszola Sep 16, 2024
837c196
[Frontend] Expose revision arg in OpenAI server (#8501)
lewtun Sep 16, 2024
acd5511
[BugFix] Fix clean shutdown issues (#8492)
njhill Sep 16, 2024
781e3b9
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506)
sasha0552 Sep 16, 2024
5d73ae4
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270)
ProExpertProg Sep 16, 2024
2759a43
[doc] update doc on testing and debugging (#8514)
youkaichao Sep 16, 2024
47f5e03
[Bugfix] Bind api server port before starting engine (#8491)
kevin314 Sep 16, 2024
5478c4b
[perf bench] set timeout to debug hanging (#8516)
simon-mo Sep 16, 2024
5ce45eb
[misc] small qol fixes for release process (#8517)
simon-mo Sep 16, 2024
cca6164
[Bugfix] Fix 3.12 builds on main (#8510)
joerunde Sep 17, 2024
546034b
[refactor] remove triton based sampler (#8524)
simon-mo Sep 17, 2024
1c1bb38
[Frontend] Improve Nullable kv Arg Parsing (#8525)
alex-jw-brooks Sep 17, 2024
ee2bcea
[Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521)
ywang96 Sep 17, 2024
99aa4ed
[torch.compile] register allreduce operations as custom ops (#8526)
youkaichao Sep 17, 2024
cbdb252
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change …
ruisearch42 Sep 17, 2024
1b6de83
[Benchmark] Support sample from HF datasets and image input for bench…
Isotr0py Sep 17, 2024
1009e93
[Encoder decoder] Add cuda graph support during decoding for encoder-…
sroy745 Sep 17, 2024
9855b99
[Feature][kernel] tensor parallelism with bitsandbytes quantization (…
chenqianfzh Sep 17, 2024
a54ed80
[Model] Add mistral function calling format to all models loaded with…
patrickvonplaten Sep 17, 2024
56c3de0
[Misc] Don't dump contents of kvcache tensors on errors (#8527)
njhill Sep 17, 2024
98f9713
[Bugfix] Fix TP > 1 for new granite (#8544)
joerunde Sep 17, 2024
fa0c114
[doc] improve installation doc (#8550)
youkaichao Sep 17, 2024
09deb47
[CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520)
alexeykondrat Sep 17, 2024
8110e44
[Kernel] Change interface to Mamba causal_conv1d_update for continuou…
tlrmchlsmth Sep 17, 2024
95965d3
[CI/Build] fix Dockerfile.cpu on podman (#8540)
dtrifiro Sep 18, 2024
e351572
[Misc] Add argument to disable FastAPI docs (#8554)
Jeffwan Sep 18, 2024
6ffa3f3
[CI/Build] Avoid CUDA initialization (#8534)
DarkLight1337 Sep 18, 2024
9d104b5
[CI/Build] Update Ruff version (#8469)
aarnphm Sep 18, 2024
7c7714d
[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#…
alexm-neuralmagic Sep 18, 2024
a8c1d16
[Core] *Prompt* logprobs support in Multi-step (#8199)
afeldman-nm Sep 18, 2024
d65798f
[Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543)
russellb Sep 18, 2024
e18749f
[Model] Support Solar Model (#8386)
shing100 Sep 18, 2024
b3195bc
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)
gshtras Sep 18, 2024
db9120c
[Kernel] Change interface to Mamba selective_state_update for continu…
tlrmchlsmth Sep 18, 2024
d9cd78e
[BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572)
njhill Sep 18, 2024
0d47bf3
[Bugfix] add `dead_error` property to engine client (#8574)
joerunde Sep 18, 2024
4c34ce8
[Kernel] Remove marlin moe templating on thread_m_blocks (#8573)
tlrmchlsmth Sep 19, 2024
3118f63
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata const…
sroy745 Sep 19, 2024
02c9afa
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer"…
ywang96 Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[CI/Build] Avoid CUDA initialization (vllm-project#8534)
  • Loading branch information
DarkLight1337 authored Sep 18, 2024
commit 6ffa3f314c59e42238f1c5f923ff2839e0af9698
9 changes: 3 additions & 6 deletions benchmarks/kernels/benchmark_layernorm.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import random
import time

import torch

from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser,
seed_everything)


@torch.inference_mode()
Expand All @@ -16,10 +16,7 @@ def main(num_tokens: int,
do_profile: bool = False,
num_warmup_iters: int = 5,
num_iters: int = 100) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device("cuda")

layer = RMSNorm(hidden_size).to(dtype=dtype)
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/kernels/benchmark_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from transformers import AutoConfig

from vllm.model_executor.layers.fused_moe.fused_moe import *
from vllm.utils import FlexibleArgumentParser
from vllm.utils import FlexibleArgumentParser, seed_everything


class BenchmarkConfig(TypedDict):
Expand Down Expand Up @@ -166,7 +166,7 @@ class BenchmarkWorker:

def __init__(self, seed: int) -> None:
torch.set_default_device("cuda")
torch.cuda.manual_seed_all(seed)
seed_everything(seed)
self.seed = seed

def benchmark(
Expand All @@ -180,7 +180,7 @@ def benchmark(
use_fp8_w8a8: bool,
use_int8_w8a16: bool,
) -> Tuple[Dict[str, int], float]:
torch.cuda.manual_seed_all(self.seed)
seed_everything(self.seed)
dtype_str = get_config_dtype_str(dtype,
use_int8_w8a16=use_int8_w8a16,
use_fp8_w8a8=use_fp8_w8a8)
Expand Down
7 changes: 2 additions & 5 deletions benchmarks/kernels/benchmark_paged_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from vllm import _custom_ops as ops
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser,
create_kv_caches_with_random)
create_kv_caches_with_random, seed_everything)

NUM_BLOCKS = 1024
PARTITION_SIZE = 512
Expand All @@ -28,10 +28,7 @@ def main(
device: str = "cuda",
kv_cache_dtype: Optional[str] = None,
) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)

scale = float(1.0 / (head_size**0.5))
query = torch.empty(num_seqs,
Expand Down
9 changes: 3 additions & 6 deletions benchmarks/kernels/benchmark_quant.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import random
import time

import torch

from vllm import _custom_ops as ops
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser,
seed_everything)


@torch.inference_mode()
Expand All @@ -17,10 +17,7 @@ def main(num_tokens: int,
do_profile: bool = False,
num_warmup_iters: int = 5,
num_iters: int = 100) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device("cuda")

x = torch.randn(num_tokens, hidden_size, dtype=dtype)
Expand Down
6 changes: 2 additions & 4 deletions benchmarks/kernels/benchmark_rope.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from vllm.model_executor.layers.rotary_embedding import (RotaryEmbedding,
get_rope)
from vllm.utils import FlexibleArgumentParser
from vllm.utils import FlexibleArgumentParser, seed_everything


def benchmark_rope_kernels_multi_lora(
Expand All @@ -22,9 +22,7 @@ def benchmark_rope_kernels_multi_lora(
max_position: int = 8192,
base: int = 10000,
) -> None:
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
if rotary_dim is None:
rotary_dim = head_size
Expand Down
9 changes: 3 additions & 6 deletions tests/kernels/test_activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from vllm.model_executor.layers.activation import (FastGELU, GeluAndMul,
NewGELU, QuickGELU,
SiluAndMul)
from vllm.utils import seed_everything

from .allclose_default import get_default_atol, get_default_rtol

Expand Down Expand Up @@ -34,9 +35,7 @@ def test_act_and_mul(
seed: int,
device: str,
) -> None:
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
x = torch.randn(num_tokens, 2 * d, dtype=dtype)
if activation == "silu":
Expand Down Expand Up @@ -77,9 +76,7 @@ def test_activation(
seed: int,
device: str,
) -> None:
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
x = torch.randn(num_tokens, d, dtype=dtype)
layer = activation[0]()
Expand Down
18 changes: 5 additions & 13 deletions tests/kernels/test_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

from tests.kernels.utils import opcheck
from vllm import _custom_ops as ops
from vllm.utils import get_max_shared_memory_bytes, is_hip
from vllm.utils import get_max_shared_memory_bytes, is_hip, seed_everything

from .allclose_default import get_default_atol, get_default_rtol

Expand Down Expand Up @@ -139,10 +139,8 @@ def test_paged_attention(
) -> None:
if kv_cache_dtype == "fp8" and head_size % 16:
pytest.skip()
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)

seed_everything(seed)
torch.set_default_device(device)
scale = float(1.0 / (head_size**0.5))
num_query_heads, num_kv_heads = num_heads
Expand Down Expand Up @@ -354,10 +352,7 @@ def test_paged_attention_rocm(
seed: int,
device: str,
) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
scale = float(1.0 / (head_size**0.5))
num_query_heads, num_kv_heads = num_heads
Expand Down Expand Up @@ -506,10 +501,7 @@ def test_multi_query_kv_attention(
seed: int,
device: str,
) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
# MAX_SEQ_LEN sometimes causes OOM in the reference implementation.
# As the xformers library is already tested with its own tests, we can use
Expand Down
2 changes: 1 addition & 1 deletion tests/kernels/test_attention_selector.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_flash_attn(monkeypatch):
override_backend_env_variable(monkeypatch, STR_FLASH_ATTN_VAL)

# Unsupported CUDA arch
with patch("torch.cuda.get_device_capability", return_value=[7, 5]):
with patch("torch.cuda.get_device_capability", return_value=(7, 5)):
backend = which_attn_to_use(8, 16, 8, None, torch.float16, None, 16)
assert backend.name != STR_FLASH_ATTN_VAL

Expand Down
5 changes: 3 additions & 2 deletions tests/kernels/test_awq_triton.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from vllm.model_executor.layers.quantization.awq_triton import (
AWQ_TRITON_SUPPORTED_GROUP_SIZES, awq_dequantize_triton, awq_gemm_triton)
from vllm.utils import seed_everything

device = "cuda"

Expand Down Expand Up @@ -79,7 +80,7 @@ def test_dequantize(qweight_rows, qweight_cols, group_size):
zeros_cols = qweight_cols
zeros_dtype = torch.int32

torch.manual_seed(0)
seed_everything(0)

qweight = torch.randint(0,
torch.iinfo(torch.int32).max,
Expand Down Expand Up @@ -133,7 +134,7 @@ def test_gemm(N, K, M, splitK, group_size):
qzeros_rows = scales_rows
qzeros_cols = qweight_cols

torch.manual_seed(0)
seed_everything(0)

input = torch.rand((input_rows, input_cols),
dtype=input_dtype,
Expand Down
12 changes: 3 additions & 9 deletions tests/kernels/test_blocksparse_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from vllm import _custom_ops as ops
from vllm.attention.ops.blocksparse_attention.interface import (
LocalStridedBlockSparseAttn)
from vllm.utils import get_max_shared_memory_bytes, is_hip
from vllm.utils import get_max_shared_memory_bytes, is_hip, seed_everything

from .allclose_default import get_default_atol, get_default_rtol

Expand Down Expand Up @@ -172,10 +172,7 @@ def test_paged_attention(
blocksparse_block_size: int,
blocksparse_head_sliding_step: int,
) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
scale = float(1.0 / (head_size**0.5))
num_query_heads, num_kv_heads = num_heads
Expand Down Expand Up @@ -386,10 +383,7 @@ def test_varlen_blocksparse_attention_prefill(
seed: int,
device: str,
) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
# MAX_SEQ_LEN sometimes causes OOM in the reference implementation.
# As the xformers library is already tested with its own tests, we can use
Expand Down
25 changes: 7 additions & 18 deletions tests/kernels/test_cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

from tests.kernels.utils import DEFAULT_OPCHECK_TEST_UTILS, opcheck
from vllm import _custom_ops as ops
from vllm.utils import seed_everything

COPYING_DIRECTION = [('cuda', 'cpu'), ('cuda', 'cuda'), ('cpu', 'cuda')]
DTYPES = [torch.half, torch.bfloat16, torch.float]
Expand Down Expand Up @@ -55,10 +56,7 @@ def test_copy_blocks(
) -> None:
if kv_cache_dtype == "fp8" and head_size % 16:
pytest.skip()
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
# Generate random block mappings where each source block is mapped to two
# destination blocks.
Expand Down Expand Up @@ -134,10 +132,7 @@ def test_reshape_and_cache(
) -> None:
if kv_cache_dtype == "fp8" and head_size % 16:
pytest.skip()
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)
# Create a random slot mapping.
num_slots = block_size * num_blocks
Expand Down Expand Up @@ -229,9 +224,7 @@ def test_reshape_and_cache_flash(
device: str,
kv_cache_dtype: str,
) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)
seed_everything(seed)
torch.set_default_device(device)

# Create a random slot mapping.
Expand Down Expand Up @@ -345,10 +338,8 @@ def test_swap_blocks(
pytest.skip()
if kv_cache_dtype == "fp8" and head_size % 16:
pytest.skip()
random.seed(seed)
torch.random.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)

seed_everything(seed)

src_device = device if direction[0] == "cuda" else 'cpu'
dst_device = device if direction[1] == "cuda" else 'cpu'
Expand Down Expand Up @@ -417,9 +408,7 @@ def test_fp8_e4m3_conversion(
seed: int,
device: str,
) -> None:
random.seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)
seed_everything(seed)

low = -224.0
high = 224.0
Expand Down
5 changes: 3 additions & 2 deletions tests/kernels/test_causal_conv1d.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

from vllm.model_executor.layers.mamba.ops.causal_conv1d import (
causal_conv1d_fn, causal_conv1d_update)
from vllm.utils import seed_everything


def causal_conv1d_ref(
Expand Down Expand Up @@ -104,7 +105,7 @@ def test_causal_conv1d(batch, dim, seqlen, width, has_bias, silu_activation,
if itype == torch.bfloat16:
rtol, atol = 1e-2, 5e-2
# set seed
torch.random.manual_seed(0)
seed_everything(0)
if not channel_last:
x = torch.randn(batch,
4096 + dim + 64,
Expand Down Expand Up @@ -175,7 +176,7 @@ def test_causal_conv1d_update(batch, dim, width, has_bias, silu_activation,
if itype == torch.bfloat16:
rtol, atol = 1e-2, 5e-2
# set seed
torch.random.manual_seed(0)
seed_everything(0)
batch = 2
x = torch.randn(batch, dim, device=device, dtype=itype)
conv_state = torch.randn(batch, dim, width, device=device, dtype=itype)
Expand Down
Loading