Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 25 02 17 #430

Merged
merged 112 commits into from
Feb 17, 2025
Merged
Show file tree
Hide file tree
Changes from 109 commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
deb6c1c
[Doc] Improve OpenVINO installation doc (#13102)
hmellor Feb 11, 2025
14ecab5
[Bugfix] Guided decoding falls back to outlines when fails to import …
terrytangyuan Feb 11, 2025
72c2b68
[Misc] Move pre-commit suggestion back to the end (#13114)
russellb Feb 11, 2025
3ee696a
[RFC][vllm-API] Support tokenizer registry for customized tokenizer i…
youngkent Feb 12, 2025
974dfd4
[Model] IBM/NASA Prithvi Geospatial model (#12830)
christian-pinto Feb 12, 2025
842b0fd
[ci] Add more source file dependencies for some tests (#13123)
khluu Feb 12, 2025
e92694b
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAtt…
lingfanyu Feb 12, 2025
a0597c6
Bump helm/kind-action from 1.10.0 to 1.12.0 (#11612)
dependabot[bot] Feb 12, 2025
dd3b4a0
Bump actions/stale from 9.0.0 to 9.1.0 (#12462)
dependabot[bot] Feb 12, 2025
0c7d9ef
Bump helm/chart-testing-action from 2.6.1 to 2.7.0 (#12463)
dependabot[bot] Feb 12, 2025
d59def4
Bump actions/setup-python from 5.3.0 to 5.4.0 (#12672)
dependabot[bot] Feb 12, 2025
7c4033a
Further reduce the HTTP calls to huggingface.co (#13107)
maxdebayser Feb 12, 2025
f1042e8
[Misc] AMD Build Improvements (#12923)
842974287 Feb 12, 2025
f4d97e4
[Bug] [V1] Try fetching stop_reason from EngineOutput before checking…
bnellnm Feb 12, 2025
985b4a2
[Bugfix] Fix num video tokens calculation for Qwen2-VL (#13148)
DarkLight1337 Feb 12, 2025
314cfad
[Frontend] Generate valid tool call IDs when using `tokenizer-mode=mi…
rafvasq Feb 12, 2025
82cabf5
[Misc] Delete unused LoRA modules (#13151)
jeejeelee Feb 12, 2025
042c341
Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path (#1…
houseroad Feb 12, 2025
2c2b560
[CI/Build] Use mypy matcher for pre-commit CI job (#13162)
russellb Feb 12, 2025
36a0863
[CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per mod…
Qubitium Feb 12, 2025
09972e7
[Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularit…
mgoin Feb 12, 2025
14b7899
[CI] Fix failing FP8 cpu offload test (#13170)
mgoin Feb 12, 2025
4c0d93f
[V1][Bugfix] Copy encoder input ids to fix set iteration issue during…
andoorve Feb 12, 2025
8eafe5e
[CI/Build] Ignore ruff warning up007 (#13182)
russellb Feb 13, 2025
9f9704d
[perf-benchmark] cleanup unused Docker images and volumes in H100 ben…
khluu Feb 13, 2025
4fc5c23
[NVIDIA] Support nvfp4 quantization (#12784)
kaixih Feb 13, 2025
d88c866
[Bugfix][Example] Fix GCed profiling server for TPU (#12792)
mgoin Feb 13, 2025
bc55d13
[VLM] Implement merged multimodal processor for Mllama (#11427)
Isotr0py Feb 13, 2025
009439c
Simplify logic of locating CUDART so file path (#13203)
houseroad Feb 13, 2025
60c68df
[Build] Automatically use the wheel of the base commit with Python-on…
comaniac Feb 13, 2025
04f50ad
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong fie…
LikeSundayLikeRain Feb 13, 2025
d46d490
[Frontend] Move CLI code into vllm.cmd package (#12971)
russellb Feb 13, 2025
cb944d5
Allow Unsloth Dynamic 4bit BnB quants to work (#12974)
danielhanchen Feb 13, 2025
0ccd876
[CI/Build] Allow ruff to auto-fix some issues (#13180)
russellb Feb 13, 2025
9605c12
[V1][core] Implement pipeline parallel on Ray (#12996)
ruisearch42 Feb 13, 2025
fa253f1
[VLM] Remove input processor from clip and siglip (#13165)
Isotr0py Feb 13, 2025
578087e
[Frontend] Pass pre-created socket to uvicorn (#13113)
russellb Feb 13, 2025
fdcf64d
[V1] Clarify input processing and multimodal feature caching logic (#…
ywang96 Feb 13, 2025
c9d3ecf
[VLM] Merged multi-modal processor for Molmo (#12966)
DarkLight1337 Feb 13, 2025
2092a6f
[V1][Core] Add worker_base for v1 worker (#12816)
AoyuQC Feb 13, 2025
02ed8a1
[Misc] Qwen2.5-VL Optimization (#13155)
wulipc Feb 13, 2025
1bc3b5e
[VLM] Separate text-only and vision variants of the same model archit…
DarkLight1337 Feb 13, 2025
37dfa60
[Bugfix] Missing Content Type returns 500 Internal Server Error (#13193)
vaibhavjainwiz Feb 13, 2025
d84cef7
[Frontend] Add `/v1/audio/transcriptions` OpenAI API endpoint (#12909)
NickLucche Feb 13, 2025
bffddd9
Add label if pre-commit passes (#12527)
hmellor Feb 13, 2025
2344192
Optimize moe_align_block_size for deepseek_v3 (#12850)
mgoin Feb 13, 2025
c1e37bf
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
tlrmchlsmth Feb 14, 2025
e38be64
Revert "Add label if pre-commit passes" (#13242)
hmellor Feb 14, 2025
4108869
[ROCm] Avoid using the default stream on ROCm (#13238)
gshtras Feb 14, 2025
8c32b08
[Kernel] Fix awq error when n is not divisable by 128 (#13227)
jinzhen-lin Feb 14, 2025
dd5ede4
[V1] Consolidate MM cache size to vllm.envs (#13239)
ywang96 Feb 14, 2025
09545c0
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250)
tlrmchlsmth Feb 14, 2025
0676782
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pr…
tlrmchlsmth Feb 14, 2025
84683fa
[Bugfix] Offline example of disaggregated prefill (#13214)
XiaobingSuper Feb 14, 2025
40932d7
[Misc] Remove redundant statements in scheduler.py (#13229)
WrRan Feb 14, 2025
f2b20fe
Consolidate Llama model usage in tests (#13094)
hmellor Feb 14, 2025
f0b2da7
Expand MLA to support most types of quantization (#13181)
mgoin Feb 14, 2025
cbc4012
[V1] LoRA - Enable Serving Usecase (#12883)
varun-sundar-rabindranath Feb 14, 2025
ba59b78
[ROCm][V1] Add intial ROCm support to V1 (#12790)
SageMoore Feb 14, 2025
b0ccfc5
[Bugfix][V1] GPUModelRunner._update_states should return True when th…
imkero Feb 14, 2025
45f90bc
[WIP] TPU V1 Support Refactored (#13049)
alexm-redhat Feb 14, 2025
185cc19
[Frontend] Optionally remove memory buffer used for uploading to URLs…
pooyadavoodi Feb 14, 2025
83481ce
[Bugfix] Fix missing parentheses (#13263)
xu-song Feb 14, 2025
556ef7f
[Misc] Log time consumption of sleep and wake-up (#13115)
waltforme Feb 14, 2025
4da1f66
[VLM] Keep track of whether prompt replacements have been applied (#1…
DarkLight1337 Feb 14, 2025
085b7b2
[V1] Simplify GPUModelRunner._update_states check (#13265)
njhill Feb 14, 2025
6224a9f
Support logit_bias in v1 Sampler (#13079)
houseroad Feb 14, 2025
7734e9a
[Core] choice-based structured output with xgrammar (#12632)
russellb Feb 14, 2025
c9e2d64
[Hardware][Gaudi][Bugfix] Fix error for guided decoding (#12317)
zhouyu5 Feb 14, 2025
5e5c8e0
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many expe…
mgoin Feb 14, 2025
3bcb8c7
[Core] Reduce TTFT with concurrent partial prefills (#10235)
joerunde Feb 14, 2025
a12934d
[V1][Core] min_p sampling support (#13191)
AoyuQC Feb 14, 2025
e7eea5a
[V1][CI] Fix failed v1-test because of min_p (#13316)
WoosukKwon Feb 15, 2025
6a854c7
[V1][Sampler] Don't apply temp for greedy-only (#13311)
njhill Feb 15, 2025
0c73026
[V1][PP] Fix memory profiling in PP (#13315)
WoosukKwon Feb 15, 2025
c9f9d5b
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't bu…
SageMoore Feb 15, 2025
579d7a6
[Bugfix][Docs] Fix offline Whisper (#13274)
NickLucche Feb 15, 2025
97a3d6d
[Bugfix] Massage MLA's usage of flash attn for RoCM (#13310)
tlrmchlsmth Feb 15, 2025
9076325
[BugFix] Don't scan entire cache dir when loading model (#13302)
njhill Feb 15, 2025
067fa22
[Bugfix]Fix search start_index of stop_checker (#13280)
xu-song Feb 15, 2025
7fdaaf4
[Bugfix] Fix qwen2.5-vl image processor (#13286)
Isotr0py Feb 15, 2025
2ad1bc7
[V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288)
markmc Feb 15, 2025
ed0de3e
[AMD] [Model] DeepSeek tunings (#13199)
rasmith Feb 15, 2025
9206b3d
[V1][PP] Run engine busy loop with batch queue (#13064)
comaniac Feb 15, 2025
54ed913
[ci/build] update flashinfer (#13323)
youkaichao Feb 15, 2025
367cb8c
[Doc] [2/N] Add Fuyu E2E example for multimodal processor (#13331)
DarkLight1337 Feb 15, 2025
80f63a3
[V1][Spec Decode] Ngram Spec Decode (#12193)
LiuXiaoxuanPKU Feb 16, 2025
12913d1
[Quant] Add `SupportsQuant` to phi3 and clip (#13104)
kylesayrs Feb 16, 2025
d3d547e
[Bugfix] Pin xgrammar to 0.1.11 (#13338)
mgoin Feb 16, 2025
ccaff7f
avoid calling hf_list_repo_files for local model
Isotr0py Feb 16, 2025
7cc05dd
annotation
Isotr0py Feb 16, 2025
dc0f7cc
[BugFix] Enhance test_pos_encoding to support execution on multi-devi…
wchen61 Feb 16, 2025
b7d3098
[V1] Update doc and examples for H2O-VL (#13349)
ywang96 Feb 16, 2025
124776e
[ci] skip failed tests for flashinfer (#13352)
youkaichao Feb 16, 2025
a0231b7
[platform] add base class for communicators (#13208)
youkaichao Feb 16, 2025
5d2965b
[Bugfix] Fix 2 Node and Spec Decode tests (#13341)
DarkLight1337 Feb 16, 2025
da833b0
[Docs] Change myenv to vllm. Update python_env_setup.inc.md (#13325)
arkylin Feb 16, 2025
7b89386
[V1][BugFix] Add __init__.py to v1/spec_decode/ (#13359)
WoosukKwon Feb 16, 2025
e18227b
[V1][PP] Cache Intermediate Tensors (#13353)
WoosukKwon Feb 16, 2025
d67cc21
[Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend ed…
Isotr0py Feb 16, 2025
69e1d23
[V1][BugFix] Clean up rejection sampler & Fix warning msg (#13362)
WoosukKwon Feb 16, 2025
2010f04
[V1][Misc] Avoid unnecessary log output (#13289)
jeejeelee Feb 17, 2025
46cdd59
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304)
ShangmingCai Feb 17, 2025
f857311
Fix spelling error in index.md (#13369)
yankooo Feb 17, 2025
4518683
Run v1 benchmark and integrate with PyTorch OSS benchmark database (#…
huydhn Feb 17, 2025
238dfc8
[MISC] tiny fixes (#13378)
MengqingCao Feb 17, 2025
7b623fc
[VLM] Check required fields before initializing field config in `Dict…
DarkLight1337 Feb 17, 2025
1f69c4a
[Model] Support Mamba2 (Codestral Mamba) (#9292)
tlrmchlsmth Feb 17, 2025
30513d1
[Bugfix] fix xpu communicator (#13368)
yma11 Feb 17, 2025
ce77eb9
[Bugfix] Fix VLLM_USE_MODELSCOPE issue (#13384)
r4ntix Feb 17, 2025
ce342c7
Merge remote-tracking branch 'upstream/main' into upstream_merge_25_0…
gshtras Feb 17, 2025
669fc3f
Merge remote-tracking branch 'Isotr0py/local-lookup' into upstream_me…
gshtras Feb 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,12 @@ steps:
#key: block-h100
#depends_on: ~

- label: "Cleanup H100"
agents:
queue: H100
depends_on: ~
command: docker system prune -a --volumes --force

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,11 @@ main() {
check_gpus
check_hf_token

# Set to v1 to run v1 benchmark
if [[ "${ENGINE_VERSION:-v0}" == "v1" ]]; then
export VLLM_USE_V1=1
fi

# dependencies
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/tests/latency-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@
"num-iters": 15
}
}
]
]
24 changes: 20 additions & 4 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -107,13 +107,17 @@ steps:
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/entrypoints/llm
- tests/entrypoints/openai
- tests/entrypoints/test_chat_utils
- tests/entrypoints/offline_mode
commands:
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/correctness/
- pytest -v -s entrypoints/test_chat_utils.py
- pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests

Expand All @@ -124,9 +128,10 @@ steps:
source_file_dependencies:
- vllm/distributed/
- vllm/core/
- tests/distributed
- tests/distributed/test_utils
- tests/distributed/test_pynccl
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile
- tests/compile/test_basic_correctness
- examples/offline_inference/rlhf.py
- examples/offline_inference/rlhf_colocate.py
commands:
Expand Down Expand Up @@ -174,6 +179,9 @@ steps:
- vllm/
- tests/engine
- tests/tokenization
- tests/test_sequence
- tests/test_config
- tests/test_logger
commands:
- pytest -v -s engine test_sequence.py test_config.py test_logger.py
# OOM in the CI unless we run this separately
Expand All @@ -197,7 +205,7 @@ steps:
- VLLM_USE_V1=1 pytest -v -s v1/e2e
# Integration test for streaming correctness (requires special branch).
- pip install -U git+https://github.com/robertgshaw2-neuralmagic/lm-evaluation-harness.git@streaming-api
- pytest -v -s entrypoints/openai/test_accuracy.py::test_lm_eval_accuracy_v1_engine
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine

- label: Examples Test # 25min
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -331,6 +339,14 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-small.txt -t 1

- label: OpenAI API correctness
source_file_dependencies:
- csrc/
- vllm/entrypoints/openai/
- vllm/model_executor/models/whisper.py
commands: # LMEval+Transcription WER check
- pytest -s entrypoints/openai/correctness/

- label: Encoder Decoder tests # 5min
source_file_dependencies:
- vllm/
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/cleanup_pr_body.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Set up Python
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
with:
python-version: '3.12'

Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
- uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
with:
python-version: "3.12"
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
- run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
with:
extra_args: --all-files --hook-stage manual
2 changes: 1 addition & 1 deletion .github/workflows/stale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ jobs:
actions: write
runs-on: ubuntu-latest
steps:
- uses: actions/stale@28ca1036281a5e5922ead5184a1bbf96e5fc984e # v9.0.0
- uses: actions/stale@5bef64f19d7facfb25b37b414482c7164d639639 # v9.1.0
with:
# Increasing this value ensures that changes to this workflow
# propagate to all issues and PRs in days rather than months
Expand Down
21 changes: 12 additions & 9 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,14 @@ repos:
rev: v0.9.3
hooks:
- id: ruff
args: [--output-format, github]
args: [--output-format, github, --fix]
exclude: 'vllm/third_party/.*'
- repo: https://github.com/codespell-project/codespell
rev: v2.4.0
hooks:
- id: codespell
exclude: 'benchmarks/sonnet.txt|(build|tests/(lora/data|models/fixtures|prompts))/.*|csrc/rocm/.*|csrc/gradlib/.*|vllm/third_party/.*'
additional_dependencies: ['tomli']
args: ['--toml', 'pyproject.toml']
- repo: https://github.com/PyCQA/isort
rev: 5.13.2
hooks:
Expand Down Expand Up @@ -116,13 +117,6 @@ repos:
language: python
types: [python]
exclude: 'vllm/third_party/.*'
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false
exclude: 'vllm/third_party/.*'
- id: check-filenames
name: Check for spaces in all filenames
entry: bash
Expand All @@ -133,3 +127,12 @@ repos:
always_run: true
pass_filenames: false
exclude: 'vllm/third_party/.*'
# Keep `suggestion` last
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false
exclude: 'vllm/third_party/.*'
# Insert new entries above the `suggestion` entry
28 changes: 23 additions & 5 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library")

# Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case.
set(CUTLASS_REVISION "v3.6.0" CACHE STRING "CUTLASS revision to use")
# Please keep this in sync with FetchContent_Declare line below.
set(CUTLASS_REVISION "v3.7.0" CACHE STRING "CUTLASS revision to use")

# Use the specified CUTLASS source directory for compilation if VLLM_CUTLASS_SRC_DIR is provided
if (DEFINED ENV{VLLM_CUTLASS_SRC_DIR})
Expand All @@ -279,6 +280,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
FetchContent_Declare(
cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
# Please keep this in sync with CUTLASS_REVISION line above.
GIT_TAG v3.7.0
GIT_PROGRESS TRUE

Expand All @@ -298,8 +300,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"csrc/custom_all_reduce.cu"
"csrc/permute_cols.cu"
"csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu"
"csrc/quantization/fp4/nvfp4_quant_entry.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_entry.cu"
"csrc/sparse/cutlass/sparse_compressor_entry.cu"
"csrc/cutlass_extensions/common.cpp")

set_gencode_flags_for_srcs(
Expand Down Expand Up @@ -392,8 +394,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/sparse/cutlass/sparse_compressor_c3x.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
set(SRCS "csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
Expand All @@ -411,6 +412,23 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
endif()

# FP4 Archs and flags
cuda_archs_loose_intersection(FP4_ARCHS "10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND FP4_ARCHS)
set(SRCS
"csrc/quantization/fp4/nvfp4_quant_kernels.cu"
)
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${FP4_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4=1")
message(STATUS "Building NVFP4 for archs: ${FP4_ARCHS}")
else()
message(STATUS "Not building NVFP4 as no compatible archs were found.")
# clear FP4_ARCHS
set(FP4_ARCHS)
endif()

#
# Machete kernels
Expand Down Expand Up @@ -497,7 +515,7 @@ define_gpu_extension_target(
SOURCES ${VLLM_EXT_SRC}
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR}
USE_SABI 3
WITH_SOABI)

Expand Down
7 changes: 5 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -195,19 +195,22 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
--mount=type=cache,target=/root/.cache/pip \
python3 -m pip install dist/*.whl --verbose

# How to build this FlashInfer wheel:
# If we need to build FlashInfer wheel before its release:
# $ export FLASHINFER_ENABLE_AOT=1
# $ # Note we remove 7.0 from the arch list compared to the list below, since FlashInfer only supports sm75+
# $ export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX'
# $ git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
# $ cd flashinfer
# $ git checkout 524304395bd1d8cd7d07db083859523fcaa246a4
# $ rm -rf build
# $ python3 setup.py bdist_wheel --dist-dir=dist --verbose
# $ ls dist
# $ # upload the wheel to a public location, e.g. https://wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.1.post1+cu124torch2.5-cp38-abi3-linux_x86_64.whl

RUN --mount=type=cache,target=/root/.cache/pip \
. /etc/environment && \
if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
python3 -m pip install https://wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl; \
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.1.post1/flashinfer_python-0.2.1.post1+cu124torch2.5-cp38-abi3-linux_x86_64.whl ; \
fi
COPY examples examples

Expand Down
Loading