Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 04 08 #173

Merged
merged 292 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
292 commits
Select commit Hold shift + click to select a range
baee28c
Reorder kv dtype check to avoid nvcc not found error on AMD platform …
cloudhan Mar 2, 2024
ce4f5a2
Add Automatic Prefix Caching (#2762)
SageMoore Mar 2, 2024
d65fac2
Add vLLM version info to logs and openai API server (#3161)
jasonacox Mar 3, 2024
996d095
[FIX] Fix styles in automatic prefix caching & add a automatic prefix…
zhuohan123 Mar 3, 2024
17c3103
Make it easy to profile workers with nsight (#3162)
pcmoritz Mar 4, 2024
d0fae88
[DOC] add setup document to support neuron backend (#2777)
liangfu Mar 4, 2024
901cf4c
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171)
gty111 Mar 4, 2024
27a7b07
Add document for vllm paged attention kernel. (#2978)
pian13131 Mar 4, 2024
9cbc7e5
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
AllenDou Mar 4, 2024
76e8a70
[Minor fix] The domain dns.google may cause a socket.gaierror excepti…
ttbachyinsda Mar 4, 2024
22de452
Push logprob generation to LLMEngine (#3065)
Yard1 Mar 4, 2024
ff578ca
Add health check, make async Engine more robust (#3015)
Yard1 Mar 4, 2024
9a4548b
Fix the openai benchmarking requests to work with latest OpenAI apis …
wangchen615 Mar 4, 2024
05af6da
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#…
hongxiayang Mar 5, 2024
8999ec3
Store `eos_token_id` in `Sequence` for easy access (#3166)
njhill Mar 5, 2024
2efce05
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
njhill Mar 6, 2024
24aecf4
[Tests] Add block manager and scheduler tests (#3108)
rkooo567 Mar 6, 2024
a33ce60
[Testing] Fix core tests (#3224)
cadedaniel Mar 6, 2024
4cb3b92
Add tqdm `dynamic_ncols=True` (#3242)
chujiezheng Mar 6, 2024
d3c04b6
Add GPTQ support for Gemma (#3200)
TechxGenus Mar 7, 2024
cbf4c05
Update requirements-dev.txt to include package for benchmarking scrip…
wangchen615 Mar 7, 2024
2daf23a
Separate attention backends (#3005)
WoosukKwon Mar 7, 2024
385da2d
Measure model memory usage (#3120)
mgoin Mar 7, 2024
8cbba46
Possible fix for conflict between Automated Prefix Caching (#2762) an…
jacobthebanana Mar 7, 2024
b35cc93
Fix auto prefix bug (#3239)
ElizaWszola Mar 8, 2024
d2339d6
Connect engine healthcheck to openai server (#3260)
njhill Mar 8, 2024
c59e120
Feature add lora support for Qwen2 (#3177)
whyiug Mar 8, 2024
1ece1ae
[Minor Fix] Fix comments in benchmark_serving (#3252)
gty111 Mar 8, 2024
99c3cfb
[Docs] Fix Unmocked Imports (#3275)
ywang96 Mar 8, 2024
1cb0cc2
[FIX] Make `flash_attn` optional (#3269)
WoosukKwon Mar 8, 2024
c2c5e09
Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241)
mgoin Mar 8, 2024
f48c679
[FIX] Fix prefix test error on main (#3286)
zhuohan123 Mar 9, 2024
8437bae
[Speculative decoding 3/9] Worker which speculates, scores, and appli…
cadedaniel Mar 9, 2024
0bba88d
Enhance lora tests with more layer and rank variations (#3243)
tterrysun Mar 10, 2024
e4a28e5
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUD…
dllehr-amd Mar 10, 2024
9e8744a
[BugFix] Fix get tokenizer when using ray (#3301)
esmeetu Mar 11, 2024
4b59f00
[Fix] Fix best_of behavior when n=1 (#3298)
njhill Mar 11, 2024
2f8844b
Re-enable the 80 char line width limit (#3305)
zhuohan123 Mar 11, 2024
657061f
[docs] Add LoRA support information for models (#3299)
pcmoritz Mar 11, 2024
4c92270
Add distributed model executor abstraction (#3191)
zhuohan123 Mar 11, 2024
c9415c1
[ROCm] Fix warp and lane calculation in blockReduceSum (#3321)
kliuae Mar 11, 2024
654865e
Support Mistral Model Inference with transformers-neuronx (#3153)
DAIZHENWEI Mar 11, 2024
b0925b3
docs: Add BentoML deployment doc (#3336)
Sherlock113 Mar 12, 2024
49a3c86
Fixes #1556 double free (#3347)
br3no Mar 13, 2024
602358f
Add kernel for GeGLU with approximate GELU (#3337)
WoosukKwon Mar 13, 2024
b167109
[Fix] Fix quantization="gptq" when using Marlin (#3319)
DreamTeamWangbowen Mar 13, 2024
e221910
add hf_transfer to requirements.txt (#3031)
RonanKMcGovern Mar 13, 2024
ba8dc95
[Minor] Fix bias in if to remove ambiguity (#3259)
hliuca Mar 13, 2024
739c350
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256)
chenxu2048 Mar 13, 2024
ae0ccb4
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism…
orsharir Mar 13, 2024
7e9bd08
Add batched RoPE kernel (#3095)
tterrysun Mar 13, 2024
c33afd8
Fix lint (#3388)
Yard1 Mar 13, 2024
eeab52a
[FIX] Simpler fix for async engine running on ray (#3371)
zhuohan123 Mar 13, 2024
81653d9
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion …
simon-mo Mar 14, 2024
a37415c
allow user to chose which vllm's merics to display in grafana (#3393)
AllenDou Mar 14, 2024
8fe8386
[Kernel] change benchmark script so that result can be directly used;…
youkaichao Mar 14, 2024
06ec486
Install `flash_attn` in Docker image (#3396)
tdoublep Mar 14, 2024
c17ca8e
Add args for mTLS support (#3410)
declark1 Mar 14, 2024
dfc7740
[issue templates] add some issue templates (#3412)
youkaichao Mar 14, 2024
54be8a0
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
chenxu2048 Mar 14, 2024
b983ba3
fix marlin config repr (#3414)
qeternity Mar 14, 2024
78b6c48
Dynamically configure shared memory size for moe_align_block_size_ker…
akhoroshev Mar 15, 2024
b522c44
[Misc] add HOST_IP env var (#3419)
youkaichao Mar 15, 2024
21539e6
Add chat templates for Falcon (#3420)
Dinghow Mar 15, 2024
253a980
Add chat templates for ChatGLM (#3418)
Dinghow Mar 15, 2024
429284d
Fix `dist.broadcast` stall without group argument (#3408)
GindaChen Mar 15, 2024
a7c8716
Fix tie_word_embeddings for Qwen2. (#3344)
fyabc Mar 15, 2024
03d37f2
[Fix] Add args for mTLS support (#3430)
declark1 Mar 15, 2024
14b8ae0
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
sighingnow Mar 15, 2024
604f235
[Misc] add error message in non linux platform (#3438)
youkaichao Mar 15, 2024
a7af453
Fix issue templates (#3436)
hmellor Mar 15, 2024
8fa7357
fix document error for value and v_vec illustration (#3421)
laneeeee Mar 15, 2024
fb96c1e
Asynchronous tokenization (#2879)
Yard1 Mar 15, 2024
10585e0
Removed Extraneous Print Message From OAI Server (#3440)
robertgshaw2-neuralmagic Mar 16, 2024
413366e
[Misc] PR templates (#3413)
youkaichao Mar 16, 2024
3123f15
Fixes the incorrect argument in the prefix-prefill test cases (#3246)
sighingnow Mar 16, 2024
14e3f9a
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (…
ronensc Mar 16, 2024
cf6ff18
Fix Baichuan chat template (#3340)
Dinghow Mar 16, 2024
ad50bf4
fix lint
simon-mo Mar 16, 2024
8e67598
[Misc] fix line length for entire codebase (#3444)
simon-mo Mar 16, 2024
120157f
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211)
simon-mo Mar 16, 2024
6b78837
Fix setup.py neuron-ls issue (#2671)
simon-mo Mar 16, 2024
abfc4f3
[Misc] Use dataclass for InputMetadata (#3452)
WoosukKwon Mar 17, 2024
93348d9
[CI] Shard tests for LoRA and Kernels to speed up (#3445)
simon-mo Mar 17, 2024
9101d83
[Bugfix] Make moe_align_block_size AMD-compatible (#3470)
WoosukKwon Mar 18, 2024
8c654c0
CI: Add ROCm Docker Build (#2886)
simon-mo Mar 18, 2024
482b0ad
[Testing] Add test_config.py to CI (#3437)
cadedaniel Mar 18, 2024
097aa0e
[CI/Build] Fix Bad Import In Test (#3473)
robertgshaw2-neuralmagic Mar 18, 2024
c0c17d4
[Misc] Fix PR Template (#3478)
zhuohan123 Mar 18, 2024
9fdf3de
Cmake based build system (#2830)
bnellnm Mar 18, 2024
49eedea
[Core] Zero-copy asdict for InputMetadata (#3475)
Yard1 Mar 18, 2024
b30880a
[Misc] Update README for the Third vLLM Meetup (#3479)
zhuohan123 Mar 18, 2024
b37cdce
[Core] Cache some utils (#3474)
Yard1 Mar 19, 2024
6a9c583
[Core] print error before deadlock (#3459)
youkaichao Mar 19, 2024
ef65dcf
[Doc] Add docs about OpenAI compatible server (#3288)
simon-mo Mar 19, 2024
7341c77
[BugFix] Avoid initializing CUDA too early (#3487)
njhill Mar 19, 2024
c614cfe
Update dockerfile with ModelScope support (#3429)
ifsheldon Mar 19, 2024
2a60c9b
[Doc] minor fix to neuron-installation.rst (#3505)
jimburtoft Mar 19, 2024
cc63d03
Revert "[Core] Cache some utils" (#3507)
simon-mo Mar 19, 2024
63e8b28
[Doc] minor fix of spelling in amd-installation.rst (#3506)
jimburtoft Mar 19, 2024
20478c4
Use lru_cache for some environment detection utils (#3508)
simon-mo Mar 19, 2024
9474e89
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator perfor…
ElizaWszola Mar 20, 2024
4ad521d
[Core] Add generic typing to `LRUCache` (#3511)
njhill Mar 20, 2024
5ee1449
[Misc] Remove cache stream and cache events (#3461)
WoosukKwon Mar 20, 2024
84eaa68
Abort when nvcc command is not found in the PATH (#3527)
AllenDou Mar 20, 2024
ba8ae1d
Check for _is_cuda() in compute_num_jobs (#3481)
bnellnm Mar 20, 2024
80e2548
[Bugfix] Fix ROCm support in CMakeLists.txt (#3534)
jamestwhedbee Mar 20, 2024
426ec4e
[1/n] Triton sampling kernel (#3186)
Yard1 Mar 20, 2024
6e435de
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
rkooo567 Mar 20, 2024
f1c0fc3
Migrate `logits` computation and gather to `model_runner` (#3233)
esmeetu Mar 20, 2024
523e30e
[BugFix] Hot fix in setup.py for neuron build (#3537)
zhuohan123 Mar 21, 2024
6ebd02b
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (#3431)
ElizaWszola Mar 21, 2024
3bbff9e
Fix 1D query issue from `_prune_hidden_states` (#3539)
rkooo567 Mar 21, 2024
4c07dd2
[🚀 Ready to be merged] Added support for Jais models (#3183)
grandiose-pizza Mar 21, 2024
8657323
[Misc][Log] Add log for tokenizer length not equal to vocabulary size…
esmeetu Mar 21, 2024
c188ecb
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (#3551)
WoosukKwon Mar 21, 2024
b7050ca
[BugFix] gemma loading after quantization or LoRA. (#3553)
taeminlee Mar 21, 2024
ea5f14e
[Bugfix][Model] Fix Qwen2 (#3554)
esmeetu Mar 22, 2024
e90fc21
[Hardware][Neuron] Refactor neuron support (#3471)
zhuohan123 Mar 22, 2024
f721096
[BugFix] Some fixes for custom allreduce kernels (#2760)
hanzhi713 Mar 22, 2024
cf2f084
Dynamic scheduler delay to improve ITL performance (#3279)
tdoublep Mar 22, 2024
bfdb1ba
[Core] Improve detokenization performance for prefill (#3469)
Yard1 Mar 22, 2024
743a0b7
[Bugfix] use SoftLockFile instead of LockFile (#3578)
kota-iizuka Mar 23, 2024
3c5ab9b
[Misc] Fix BLOOM copyright notice (#3591)
WoosukKwon Mar 24, 2024
f8a12ec
[Misc] Bump transformers version (#3592)
ywang96 Mar 24, 2024
af9e534
[BugFix] Fix Falcon tied embeddings (#3590)
WoosukKwon Mar 24, 2024
41deac4
[BugFix] 1D query fix for MoE models (#3597)
njhill Mar 24, 2024
8b268a4
[CI] typo fix: is_hip --> is_hip() (#3595)
youkaichao Mar 24, 2024
42bc386
[CI/Build] respect the common environment variable MAX_JOBS (#3600)
youkaichao Mar 25, 2024
837e185
[CI/Build] fix flaky test (#3602)
youkaichao Mar 25, 2024
6d93d35
[BugFix] tensor.get_device() -> tensor.device (#3604)
jikunshang Mar 25, 2024
56a8652
[Bugfix] store lock file in tmp directory (#3578)" (#3599)
WoosukKwon Mar 25, 2024
b0dfa91
[Model] Add starcoder2 awq support (#3569)
shaonianyr Mar 25, 2024
925f333
[Core] Refactor Attention Take 2 (#3462)
WoosukKwon Mar 25, 2024
e67c295
[Bugfix] fix automatic prefix args and add log info (#3608)
gty111 Mar 25, 2024
01bfb22
[CI] Try introducing isort. (#3495)
rkooo567 Mar 25, 2024
819924e
[Core] Adding token ranks along with logprobs (#3516)
SwapnilDreams100 Mar 25, 2024
c13ad1b
feat: implement the min_tokens sampling parameter (#3124)
tjohnson31415 Mar 25, 2024
0b4997e
[Bugfix] API stream returning two stops (#3450)
dylanwhawk Mar 25, 2024
f408d05
hotfix isort on logprobs ranks pr (#3622)
simon-mo Mar 25, 2024
64172a9
[Feature] Add vision language model support. (#3042)
xwjiang2010 Mar 25, 2024
3a24309
Optimize `_get_ranks` in Sampler (#3623)
Yard1 Mar 25, 2024
dfeb2ec
[Misc] Include matched stop string/token in responses (#2976)
njhill Mar 26, 2024
8af890a
Enable more models to inference based on LoRA (#3382)
jeejeelee Mar 26, 2024
a979d97
[Bugfix] Fix ipv6 address parsing bug (#3641)
liiliiliil Mar 26, 2024
0dc7227
[BugFix] Fix ipv4 address parsing regression (#3645)
njhill Mar 26, 2024
566b57c
[Kernel] support non-zero cuda devices in punica kernels (#3636)
jeejeelee Mar 27, 2024
7687934
[Doc]add lora support (#3649)
jeejeelee Mar 27, 2024
e66b629
[Misc] Minor fix in KVCache type (#3652)
WoosukKwon Mar 27, 2024
8f44fac
[Core] remove cupy dependency (#3625)
youkaichao Mar 27, 2024
82c540b
[Bugfix] More faithful implementation of Gemma (#3653)
WoosukKwon Mar 27, 2024
d18f4e7
[Bugfix] [Hotfix] fix nccl library name (#3661)
youkaichao Mar 27, 2024
e24336b
[Model] Add support for DBRX (#3660)
megha95 Mar 27, 2024
1956931
[Misc] add the "download-dir" option to the latency/throughput benchm…
AmadeusChan Mar 27, 2024
45b6ef6
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (…
ywang96 Mar 27, 2024
1182607
Add support for Cohere's Command-R model (#3433)
zeppombal Mar 27, 2024
6d9aa00
[Docs] Add Command-R to supported models (#3669)
WoosukKwon Mar 27, 2024
10e6322
[Model] Fix and clean commandr (#3671)
esmeetu Mar 28, 2024
098e177
[Model] Add support for xverse (#3610)
hxer7963 Mar 28, 2024
3492859
[CI/Build] update default number of jobs and nvcc threads to avoid ov…
youkaichao Mar 28, 2024
8267b06
[Kernel] Add Triton MoE kernel configs for DBRX on A100 (#3679)
WoosukKwon Mar 28, 2024
14ccd94
[Core][Bugfix]Refactor block manager for better testability (#3492)
cadedaniel Mar 28, 2024
d6ea427
[Model] Add support for Qwen2MoeModel (#3346)
wenyujin333 Mar 28, 2024
ce567a2
[Kernel] DBRX Triton MoE kernel H100 (#3692)
ywang96 Mar 28, 2024
b51c1cc
[2/N] Chunked prefill data update (#3538)
rkooo567 Mar 28, 2024
1715056
[Bugfix] Update neuron_executor.py to add optional vision_language_co…
adamrb Mar 28, 2024
96aa014
fix benchmark format reporting in buildkite (#3693)
simon-mo Mar 28, 2024
a4075cb
[CI] Add test case to run examples scripts (#3638)
simon-mo Mar 28, 2024
515386e
[Core] Support multi-node inference(eager and cuda graph) (#3686)
esmeetu Mar 28, 2024
cb40b3a
[Kernel] Add MoE Triton kernel configs for A100 40GB (#3700)
WoosukKwon Mar 28, 2024
c0935c9
[Bugfix] Set enable_prefix_caching=True in prefix caching example (#3…
WoosukKwon Mar 28, 2024
4716a32
fix logging msg for block manager (#3701)
simon-mo Mar 28, 2024
0267fef
[Core] fix del of communicator (#3702)
youkaichao Mar 29, 2024
98a42e7
[Benchmark] Change mii to use persistent deployment and support tenso…
IKACE Mar 29, 2024
27a57ca
bump version to v0.4.0 (#3705)
simon-mo Mar 29, 2024
f342153
Revert "bump version to v0.4.0" (#3708)
youkaichao Mar 29, 2024
26422e4
[Test] Make model tests run again and remove --forked from pytest (#3…
rkooo567 Mar 29, 2024
395aa82
[Misc] Minor type annotation fix (#3716)
WoosukKwon Mar 29, 2024
756b30a
[Core][Test] move local_rank to the last arg with default value(#3711)
youkaichao Mar 29, 2024
7bc94a0
add ccache to docker build image (#3704)
simon-mo Mar 29, 2024
d8658c8
Usage Stats Collection (#2852)
yhu422 Mar 29, 2024
6110c39
[BugFix] Fix tokenizer out of vocab size (#3685)
esmeetu Mar 29, 2024
f510395
[BugFix][Frontend] Fix completion logprobs=0 error (#3731)
esmeetu Mar 29, 2024
97356f3
[Bugfix] Command-R Max Model Length (#3727)
ywang96 Mar 29, 2024
430530f
bump version to v0.4.0 (#3712)
simon-mo Mar 29, 2024
9765b5c
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention …
hongxiayang Mar 29, 2024
8b2d3cb
usage lib get version another way (#3735)
simon-mo Mar 29, 2024
991143c
[BugFix] Use consistent logger everywhere (#3738)
njhill Mar 29, 2024
203d4f8
[Core][Bugfix] cache len of tokenizer (#3741)
youkaichao Mar 30, 2024
3ad438c
Fix build when nvtools is missing (#3698)
bnellnm Mar 30, 2024
51c31bc
CMake build elf without PTX (#3739)
simon-mo Mar 30, 2024
b6d1035
[Kernel] Layernorm performance optimization (#3662)
mawong-amd Mar 30, 2024
9c82a1b
[Doc] Update installation doc (#3746)
youkaichao Mar 30, 2024
563c1d7
[CI/Build] Make Marlin Tests Green (#3753)
robertgshaw2-neuralmagic Mar 31, 2024
f03cc66
[Misc] Minor fixes in requirements.txt (#3769)
WoosukKwon Apr 1, 2024
49782fc
[Misc] Some minor simplifications to detokenization logic (#3670)
njhill Apr 1, 2024
ccb58b2
[Misc] Fix Benchmark TTFT Calculation for Chat Completions (#3768)
ywang96 Apr 1, 2024
93deb0b
[Speculative decoding 4/9] Lookahead scheduling for speculative decod…
cadedaniel Apr 1, 2024
7d4e1b8
[Misc] Add support for new autogptq checkpoint_format (#3689)
Qubitium Apr 1, 2024
eb69d68
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by …
cadedaniel Apr 2, 2024
0e3f06f
[Hardware][Intel] Add CPU inference backend (#3634)
bigPYJ1151 Apr 2, 2024
77a6572
[HotFix] [CI/Build] Minor fix for CPU backend CI (#3787)
bigPYJ1151 Apr 2, 2024
0739b19
[Frontend][Bugfix] allow using the default middleware with a root pat…
A-Mahla Apr 2, 2024
3bec41f
[Doc] Fix vLLMEngine Doc Page (#3791)
ywang96 Apr 2, 2024
205b949
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build (#3801)
youkaichao Apr 2, 2024
ad6eca4
Fix early CUDA init via get_architecture_class_name import (#3770)
leiwen83 Apr 2, 2024
b321d48
[Bugfix] Add `__init__.py` files for `vllm/core/block/` and `vllm/spe…
mgoin Apr 2, 2024
a3c226e
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary (#3803)
youkaichao Apr 2, 2024
5757d90
[Speculative decoding] Adding configuration object for speculative de…
cadedaniel Apr 3, 2024
c9b506d
[BugFix] Use different mechanism to get vllm version in `is_cpu()` (#…
njhill Apr 3, 2024
76b889b
[Doc] Update README.md (#3806)
robertgshaw2-neuralmagic Apr 3, 2024
c64cf38
[Doc] Update contribution guidelines for better onboarding (#3819)
michaelfeil Apr 3, 2024
3dcb3e8
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
rkooo567 Apr 3, 2024
2ff767b
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
AdrianAbeyta Apr 3, 2024
b95047f
[Misc] Publish 3rd meetup slides (#3835)
WoosukKwon Apr 3, 2024
294f8f6
[BugFix] Pass tokenizer_config to local_tokenizer_group (#3754)
sighingnow Apr 4, 2024
537ee25
[Core] Enable hf_transfer by default if available (#3817)
michaelfeil Apr 4, 2024
498eb5c
[Bugfix] Add kv_scale input parameter to CPU backend (#3840)
WoosukKwon Apr 4, 2024
aabe8f4
[Core] [Frontend] Make detokenization optional (#3749)
mgerstgrasser Apr 4, 2024
819a309
[Bugfix] Fix args in benchmark_serving (#3836)
CatherineSue Apr 4, 2024
b778200
[Benchmark] Refactor sample_requests in benchmark_throughput (#3613)
gty111 Apr 4, 2024
ca81ff5
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805)
youkaichao Apr 4, 2024
db2a6a4
[Hardware][CPU] Update cpu torch to match default of 2.2.1 (#3854)
mgoin Apr 4, 2024
9117f89
[Model] Cohere CommandR+ (#3829)
saurabhdash2512 Apr 4, 2024
c391e4b
[Core] improve robustness of pynccl (#3860)
youkaichao Apr 4, 2024
78107fa
[Doc]Add asynchronous engine arguments to documentation. (#3810)
SeanGallen Apr 5, 2024
d03d64f
[CI/Build] refactor dockerfile & fix pip cache
youkaichao Apr 5, 2024
e5043a3
[Misc] Add pytest marker to opt-out of global test cleanup (#3863)
cadedaniel Apr 5, 2024
e0dd4d3
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864)
cadedaniel Apr 5, 2024
9edec65
[Bugfix] Fixing requirements.txt (#3865)
noamgat Apr 5, 2024
cfaf49a
[Misc] Define common requirements (#3841)
WoosukKwon Apr 5, 2024
1d7c940
Add option to completion API to truncate prompt tokens (#3144)
tdoublep Apr 5, 2024
18de883
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
rkooo567 Apr 5, 2024
54951ac
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#…
Isotr0py Apr 5, 2024
e4be7d7
[CI/Benchmark] add more iteration and use median for robust latency b…
youkaichao Apr 6, 2024
95baec8
[Core] enable out-of-tree model register (#3871)
youkaichao Apr 7, 2024
2f19283
[Core] latency optimization (#3890)
youkaichao Apr 7, 2024
0ce0539
[Bugfix] Fix Llava inference with Tensor Parallelism. (#3883)
Isotr0py Apr 7, 2024
b4543c8
[Model] add minicpm (#3893)
SUDA-HLT-ywfang Apr 8, 2024
52d61ba
init
SageMoore Apr 8, 2024
ae82a4e
format
SageMoore Apr 8, 2024
a4c891a
update dockerfile
SageMoore Apr 8, 2024
d3a00f9
update github actions
SageMoore Apr 8, 2024
ae21cef
update xformers
SageMoore Apr 9, 2024
aab3f5b
fix scheduler tests
SageMoore Apr 9, 2024
61ef56c
updated setup.py, collect_env.py, and the docker file to use magic-wa…
SageMoore Apr 9, 2024
33415d0
updated requirements-benchmark.txt to use nm-magic-wand-nightly
SageMoore Apr 9, 2024
7020a7c
updated dockerfile to use external magic-wand-nightly
SageMoore Apr 9, 2024
2763ce3
format
SageMoore Apr 9, 2024
d4e55ba
add test_compressed_memory to skipped list
SageMoore Apr 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ steps:
command: pytest -v -s engine tokenization test_sequence.py test_config.py

- label: Entrypoints Test
command: pytest -v -s entrypoints
commands:
# these tests have to be separated, because each one will allocate all posible GPU memory
- pytest -v -s entrypoints --ignore=entrypoints/test_server_oot_registration.py
- pytest -v -s entrypoints/test_server_oot_registration.py

- label: Examples Test
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -90,7 +93,7 @@ steps:
- bash run-benchmarks.sh

- label: Documentation Build
working_dir: "/vllm-workspace/docs"
working_dir: "/vllm-workspace/test_docs/docs"
no_gpu: True
commands:
- pip install -r requirements-docs.txt
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ jobs:
matrix:
os: ['ubuntu-20.04']
python-version: ['3.8', '3.9', '3.10', '3.11']
pytorch-version: ['2.1.2'] # Must be the most recent version that meets requirements.txt.
pytorch-version: ['2.2.1'] # Must be the most recent version that meets requirements-cuda.txt.
mgoin marked this conversation as resolved.
Show resolved Hide resolved
cuda-version: ['11.8', '12.1']

steps:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/scripts/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ LD_LIBRARY_PATH=${cuda_home}/lib64:$LD_LIBRARY_PATH

# Install requirements
$python_executable -m pip install wheel packaging
$python_executable -m pip install -r requirements.txt
$python_executable -m pip install -r requirements-cuda.txt
SageMoore marked this conversation as resolved.
Show resolved Hide resolved

# Limit the number of parallel jobs to avoid OOM
export MAX_JOBS=1
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,7 @@ _build/
# hip files generated by PyTorch
*.hip
*_hip*
hip_compat.h

# Benchmark dataset
*.json
Expand Down
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ set(PYTHON_SUPPORTED_VERSIONS "3.8" "3.9" "3.10" "3.11")
set(CUDA_SUPPORTED_ARCHS "7.0;7.5;8.0;8.6;8.9;9.0")

# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx908;gfx90a;gfx942;gfx1100")
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100")

#
# Supported/expected torch versions for CUDA/ROCm.
Expand All @@ -31,7 +31,7 @@ set(HIP_SUPPORTED_ARCHS "gfx908;gfx90a;gfx942;gfx1100")
# requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.1.2")
set(TORCH_SUPPORTED_VERSION_CUDA "2.2.1")
set(TORCH_SUPPORTED_VERSION_ROCM_5X "2.0.1")
set(TORCH_SUPPORTED_VERSION_ROCM_6X "2.1.1")

Expand Down
3 changes: 2 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ Express your support on Twitter if vLLM aids you, or simply offer your appreciat
### Build from source

```bash
pip install -r requirements.txt
pip install -e . # This may take several minutes.
```

Expand All @@ -30,6 +29,8 @@ pip install -e . # This may take several minutes.
```bash
pip install -r requirements-dev.txt

# linting and formatting
bash format.sh
# Static type checking
mypy
# Unit tests
Expand Down
84 changes: 66 additions & 18 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# to run the OpenAI compatible server.

#################### BASE BUILD IMAGE ####################
# prepare basic build environment
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev

RUN apt-get update -y \
Expand All @@ -16,18 +17,26 @@ RUN ldconfig /usr/local/cuda-12.1/compat/
WORKDIR /workspace

# install build and runtime dependencies
COPY requirements.txt requirements.txt
COPY requirements-common.txt requirements-common.txt
COPY requirements-cuda.txt requirements-cuda.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt
pip install -r requirements-cuda.txt

# install development dependencies
COPY requirements-dev.txt requirements-dev.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements-dev.txt

# cuda arch list used by torch
# can be useful for both `dev` and `test`
# explicitly set the list to avoid issues with torch 2.2
# see https://github.com/pytorch/pytorch/pull/123243
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0+PTX'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
#################### BASE BUILD IMAGE ####################


#################### EXTENSION BUILD IMAGE ####################
#################### WHEEL BUILD IMAGE ####################
FROM dev AS build

# install build dependencies
Expand All @@ -38,18 +47,16 @@ RUN --mount=type=cache,target=/root/.cache/pip \
# install compiler cache to speed up compilation leveraging local or remote caching
RUN apt-get update -y && apt-get install -y ccache

# copy input files
# files and directories related to build wheels
COPY csrc csrc
COPY setup.py setup.py
COPY cmake cmake
COPY CMakeLists.txt CMakeLists.txt
COPY requirements.txt requirements.txt
COPY requirements-common.txt requirements-common.txt
COPY requirements-cuda.txt requirements-cuda.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need "requirements-dev.txt" or is this file being deprecated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I'll check to see if it looks like we need anything in there.

COPY pyproject.toml pyproject.toml
COPY vllm/__init__.py vllm/__init__.py
COPY vllm vllm

# cuda arch list used by torch
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0+PTX'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
# max jobs used by Ninja to build extensions
ARG max_jobs=2
ENV MAX_JOBS=${max_jobs}
Expand All @@ -61,7 +68,15 @@ ENV VLLM_INSTALL_PUNICA_KERNELS=1

ENV CCACHE_DIR=/root/.cache/ccache
RUN --mount=type=cache,target=/root/.cache/ccache \
python3 setup.py build_ext --inplace
--mount=type=cache,target=/root/.cache/pip \
python3 setup.py bdist_wheel --dist-dir=dist

# the `vllm_nccl` package must be installed from source distribution
# pip is too smart to store a wheel in the cache, and other CI jobs
# will directly use the wheel from the cache, which is not what we want.
# we need to remove it manually
RUN --mount=type=cache,target=/root/.cache/pip \
pip cache remove vllm_nccl*
#################### EXTENSION Build IMAGE ####################

#################### FLASH_ATTENTION Build IMAGE ####################
Expand All @@ -79,17 +94,36 @@ WORKDIR /usr/src/flash-attention-v2
RUN pip --verbose wheel flash-attn==${FLASH_ATTN_VERSION} \
--no-build-isolation --no-deps --no-cache-dir

#################### FLASH_ATTENTION Build IMAGE ####################
#################### vLLM installation IMAGE ####################
# image with vLLM installed
FROM nvidia/cuda:12.1.0-base-ubuntu22.04 AS vllm-base
SageMoore marked this conversation as resolved.
Show resolved Hide resolved
WORKDIR /vllm-workspace

RUN apt-get update -y \
&& apt-get install -y python3-pip git vim

# Workaround for https://github.com/openai/triton/issues/2507 and
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-12.1/compat/

# install vllm wheel first, so that torch etc will be installed
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
--mount=type=cache,target=/root/.cache/pip \
pip install dist/*.whl --verbose

RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \
--mount=type=cache,target=/root/.cache/pip \
pip install /usr/src/flash-attention-v2/*.whl --no-cache-dir

#################### TEST IMAGE ####################
# image to run unit testing suite
FROM dev AS test
# note that this uses vllm installed by `pip`
FROM vllm-base AS test

# copy pytorch extensions separately to avoid having to rebuild
# when python code changes
WORKDIR /vllm-workspace
# ADD is used to preserve directory structure
ADD . /vllm-workspace/
<<<<<<< HEAD
mgoin marked this conversation as resolved.
Show resolved Hide resolved
COPY --from=build /workspace/vllm/*.so /vllm-workspace/vllm/
# Install flash attention (from pre-built wheel)
RUN --mount=type=bind,from=flash-attn-builder,src=/usr/src/flash-attention-v2,target=/usr/src/flash-attention-v2 \
Expand Down Expand Up @@ -124,17 +158,31 @@ RUN --mount=type=cache,target=/root/.cache/pip \
pip install nm-magic-wand

#################### RUNTIME BASE IMAGE ####################
=======

# install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements-dev.txt
>>>>>>> upstream/main

# doc requires source code
# we hide them inside `test_docs/` , so that this source code
# will not be imported by other tests
RUN mkdir test_docs
RUN mv docs test_docs/
RUN mv vllm test_docs/

#################### TEST IMAGE ####################

#################### OPENAI API SERVER ####################
# openai api server alternative
FROM vllm-base AS vllm-openai

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer modelscope

COPY --from=build /workspace/vllm/*.so /workspace/vllm/
COPY vllm vllm
ENV VLLM_USAGE_SOURCE production-docker-image

ENV VLLM_USAGE_SOURCE production-docker-image

Expand Down
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
include LICENSE
include requirements.txt
include requirements-common.txt
include requirements-cuda.txt
include CMakeLists.txt

recursive-include licenses *
Expand Down
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@

## Installation

The [nm-vllm PyPi package](https://pypi.org/project/nm-vllm/) includes pre-compiled binaries for CUDA (version 12.1) kernels, streamlining the setup process. For other PyTorch or CUDA versions, please compile the package from source.
SageMoore marked this conversation as resolved.
Show resolved Hide resolved

Install it using pip:
```bash
pip install nm-vllm
Expand Down
32 changes: 28 additions & 4 deletions benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ def main(args: argparse.Namespace):
dtype=args.dtype,
enforce_eager=args.enforce_eager,
kv_cache_dtype=args.kv_cache_dtype,
quantization_param_path=args.quantization_param_path,
device=args.device,
ray_workers_use_nsight=args.ray_workers_use_nsight,
enable_chunked_prefill=args.enable_chunked_prefill,
Expand Down Expand Up @@ -67,7 +68,8 @@ def run_to_completion(profile_dir: Optional[str] = None):
return latency

print("Warming up...")
run_to_completion(profile_dir=None)
for _ in tqdm(range(args.num_iters_warmup), desc="Warmup iterations"):
run_to_completion(profile_dir=None)

if args.profile:
profile_dir = args.profile_result_dir
Expand All @@ -83,7 +85,12 @@ def run_to_completion(profile_dir: Optional[str] = None):
latencies = []
for _ in tqdm(range(args.num_iters), desc="Profiling iterations"):
latencies.append(run_to_completion(profile_dir=None))
latencies = np.array(latencies)
percentages = [10, 25, 50, 75, 90]
percentiles = np.percentile(latencies, percentages)
print(f'Avg latency: {np.mean(latencies)} seconds')
for percentage, percentile in zip(percentages, percentiles):
print(f'{percentage}% percentile latency: {percentile} seconds')


if __name__ == '__main__':
Expand All @@ -105,9 +112,13 @@ def run_to_completion(profile_dir: Optional[str] = None):
default=1,
help='Number of generated sequences per prompt.')
parser.add_argument('--use-beam-search', action='store_true')
parser.add_argument('--num-iters-warmup',
type=int,
default=10,
help='Number of iterations to run for warmup.')
parser.add_argument('--num-iters',
type=int,
default=3,
default=30,
help='Number of iterations to run.')
parser.add_argument('--trust-remote-code',
action='store_true',
Expand All @@ -127,10 +138,23 @@ def run_to_completion(profile_dir: Optional[str] = None):
parser.add_argument(
"--kv-cache-dtype",
type=str,
choices=['auto', 'fp8_e5m2'],
choices=['auto', 'fp8'],
default='auto',
help=
'Data type for kv cache storage. If "auto", will use model data type.')
'Data type for kv cache storage. If "auto", will use model data type. '
'FP8_E5M2 (without scaling) is only supported on cuda version greater '
'than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for '
'common inference criteria.')
parser.add_argument(
'--quantization-param-path',
type=str,
default=None,
help='Path to the JSON file containing the KV cache scaling factors. '
'This should generally be supplied, when KV cache dtype is FP8. '
'Otherwise, KV cache scaling factors default to 1.0, which may cause '
'accuracy issues. FP8_E5M2 (without scaling) is only supported on '
'cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is '
'instead supported for common inference criteria.')
parser.add_argument(
'--profile',
action='store_true',
Expand Down
23 changes: 13 additions & 10 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,9 @@ def sample_sonnet_requests(
prefix_len: int,
tokenizer: PreTrainedTokenizerBase,
) -> List[Tuple[str, str, int, int]]:
assert input_len > prefix_len, "input_len must be greater than prefix_len."
assert (
input_len > prefix_len
), "'args.sonnet-input-len' must be greater than 'args.prefix-input-len'."

# Load the dataset.
with open(dataset_path) as f:
Expand All @@ -133,16 +135,17 @@ def sample_sonnet_requests(
base_message, add_generation_prompt=True, tokenize=False)
base_prompt_offset = len(tokenizer(base_prompt_formatted).input_ids)

assert (input_len > base_prompt_offset
), f"Please set 'args.input-len' higher than {base_prompt_offset}."
assert (
input_len > base_prompt_offset
), f"Please set 'args.sonnet-input-len' higher than {base_prompt_offset}."
num_input_lines = round(
(input_len - base_prompt_offset) / average_poem_len)

# First approximately `prefix_len` number of tokens in the
# prompt are fixed poem lines.
assert (
prefix_len > base_prompt_offset
), f"Please set 'args.prefix-len' higher than {base_prompt_offset}."
), f"Please set 'args.sonnet-prefix-len' higher than {base_prompt_offset}."

num_prefix_lines = round(
(prefix_len - base_prompt_offset) / average_poem_len)
Expand Down Expand Up @@ -375,9 +378,9 @@ def main(args: argparse.Namespace):
input_requests = sample_sonnet_requests(
dataset_path=args.dataset_path,
num_requests=args.num_prompts,
input_len=args.input_len,
output_len=args.output_len,
prefix_len=args.prefix_len,
input_len=args.sonnet_input_len,
output_len=args.sonnet_output_len,
prefix_len=args.sonnet_prefix_len,
tokenizer=tokenizer,
)
input_requests = [(prompt, prompt_len, output_len)
Expand All @@ -390,9 +393,9 @@ def main(args: argparse.Namespace):
input_requests = sample_sonnet_requests(
dataset_path=args.dataset_path,
num_requests=args.num_prompts,
input_len=args.input_len,
output_len=args.output_len,
prefix_len=args.prefix_len,
input_len=args.sonnet_input_len,
output_len=args.sonnet_output_len,
prefix_len=args.sonnet_prefix_len,
tokenizer=tokenizer,
)
input_requests = [(prompt_formatted, prompt_len, output_len)
Expand Down
Loading
Loading