Skip to content

Commit

Permalink
add quay source
Browse files Browse the repository at this point in the history
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272)

[Frontend] Add progress reporting to run_batch.py (vllm-project#8060)

Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io>

[Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292)

[Misc] GPTQ Activation Ordering (vllm-project#8135)

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (vllm-project#8319)

[Bugfix] Fix missing `post_layernorm` in CLIP (vllm-project#8155)

[CI/Build] enable ccache/scccache for HIP builds (vllm-project#8327)

[Frontend] Clean up type annotations for mistral tokenizer (vllm-project#8314)

[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (vllm-project#8130)

Fix ppc64le buildkite job (vllm-project#8309)

[Spec Decode] Move ops.advance_step to flash attn advance_step (vllm-project#8224)

[Misc] remove peft as dependency for prompt models (vllm-project#8162)

[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (vllm-project#8342)

[Bugfix] lookahead block table with cuda graph max capture (vllm-project#8340)

[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (vllm-project#8340)

[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (vllm-project#8172)

[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (vllm-project#8043)

[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (vllm-project#8329)

[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (vllm-project#8299)

[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (vllm-project#6112)

[model] Support for Llava-Next-Video model (vllm-project#7559)

Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (vllm-project#8347)

[Model][VLM] Add Qwen2-VL model support (vllm-project#7905)

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (vllm-project#7257)

[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (vllm-project#8373)

[Bugfix] Add missing attributes in mistral tokenizer (vllm-project#8364)

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

Co-authored-by: Sage Moore <sage@neuralmagic.com>

[Misc] Move device options to a single place (vllm-project#8322)

[Speculative Decoding] Test refactor (vllm-project#8317)

Co-authored-by: youkaichao <youkaichao@126.com>

Pixtral (vllm-project#8377)

Co-authored-by: Roger Wang <ywang@roblox.com>

Bump version to v0.6.1 (vllm-project#8379)

[MISC] Dump model runner inputs when crashing (vllm-project#8305)

[misc] remove engine_use_ray (vllm-project#8126)

[TPU] Use Ray for default distributed backend (vllm-project#8389)

Fix the AMD weight loading tests (vllm-project#8390)

[Bugfix]: Fix the logic for deciding if tool parsing is used (vllm-project#8366)

[Gemma2] add bitsandbytes support for Gemma2 (vllm-project#8338)

[Misc] Raise error when using encoder/decoder model with cpu backend (vllm-project#8355)

[Misc] Use RoPE cache for MRoPE (vllm-project#8396)

[torch.compile] hide slicing under custom op for inductor (vllm-project#8384)

[Hotfix][VLM] Fixing max position embeddings for Pixtral (vllm-project#8399)

[Bugfix] Fix InternVL2 inference with various num_patches (vllm-project#8375)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Model] Support multiple images for qwen-vl (vllm-project#8247)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (vllm-project#8403)

[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (vllm-project#8423)

[Bugfix] Offline mode fix (vllm-project#8376)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[multi-step] add flashinfer backend (vllm-project#7928)

[Core] Add engine option to return only deltas or final output (vllm-project#7381)

[Bugfix] multi-step + flashinfer: ensure cuda graph compatible  (vllm-project#8427)

[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (vllm-project#8425)

[CI/Build] Disable multi-node test for InternVL2 (vllm-project#8428)

[Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415)

[Bugfix] Fix weight loading issue by rename variable. (vllm-project#8293)

[Misc] Update Pixtral example (vllm-project#8431)

[BugFix] fix group_topk (vllm-project#8430)

[Core] Factor out input preprocessing to a separate class (vllm-project#7329)

[Bugfix] Mapping physical device indices for e2e test utils (vllm-project#8290)

[Bugfix] Bump fastapi and pydantic version (vllm-project#8435)

[CI/Build] Update pixtral tests to use JSON (vllm-project#8436)

[Bugfix] Fix async log stats (vllm-project#8417)

[bugfix] torch profiler bug for single gpu with GPUExecutor (vllm-project#8354)

bump version to v0.6.1.post1 (vllm-project#8440)

[CI/Build] Enable InternVL2 PP test only on single node (vllm-project#8437)

[doc] recommend pip instead of conda (vllm-project#8446)

[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (vllm-project#8442)

[misc][ci] fix quant test (vllm-project#8449)

[Installation] Gate FastAPI version for Python 3.8 (vllm-project#8456)

[plugin][torch.compile] allow to add custom compile backend (vllm-project#8445)

[CI/Build] Reorganize models tests (vllm-project#7820)

[Doc] Add oneDNN installation to CPU backend documentation (vllm-project#8467)

[HotFix] Fix final output truncation with stop string + streaming (vllm-project#8468)

bump version to v0.6.1.post2 (vllm-project#8473)

[Hardware][intel GPU] bump up ipex version to 2.3 (vllm-project#8365)

Co-authored-by: Yan Ma <yan.ma@intel.com>

[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (vllm-project#8310)

[Model] support minicpm3 (vllm-project#8297)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[torch.compile] fix functionalization (vllm-project#8480)

[torch.compile] add a flag to disable custom op (vllm-project#8488)

[TPU] Implement multi-step scheduling (vllm-project#8489)

[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (vllm-project#8490)

[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (vllm-project#8357)

[Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032)

Co-authored-by: Dipika <dipikasikka1@gmail.com>

[Frontend] Expose revision arg in OpenAI server (vllm-project#8501)

[BugFix] Fix clean shutdown issues (vllm-project#8492)

[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506)

[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270)

[doc] update doc on testing and debugging (vllm-project#8514)

[Bugfix] Bind api server port before starting engine (vllm-project#8491)

[perf bench] set timeout to debug hanging (vllm-project#8516)

[misc] small qol fixes for release process (vllm-project#8517)

[Bugfix] Fix 3.12 builds on main (vllm-project#8510)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[refactor] remove triton based sampler (vllm-project#8524)

[Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521)

[torch.compile] register allreduce operations as custom ops (vllm-project#8526)

[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495)

[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631)

[Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434)

[Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527)

[Bugfix] Fix TP > 1 for new granite (vllm-project#8544)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[doc] improve installation doc (vllm-project#8550)

Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>

[CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520)

[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012)

[CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540)

[Misc] Add argument to disable FastAPI docs (vllm-project#8554)

[CI/Build] Avoid CUDA initialization (vllm-project#8534)

[CI/Build] Update Ruff version (vllm-project#8469)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

[Core] *Prompt* logprobs support in Multi-step (vllm-project#8199)

[Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

[Model] Support Solar Model (vllm-project#8386)

Co-authored-by: Michael Goin <michael@neuralmagic.com>

[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380)

Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

[Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039)

[BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572)

[Bugfix] add `dead_error` property to engine client (vllm-project#8574)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573)

Co-authored-by: lwilkinson@neuralmagic.com

[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models.  (vllm-project#8545)

Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593)

[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616)

[MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615)

[Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584)

[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577)

[Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619)

[Doc] Add documentation for GGUF quantization (vllm-project#8618)

Create SECURITY.md (vllm-project#8642)

[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551)

[Misc] guard against change in cuda library name (vllm-project#8609)

[Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571)

[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474)

[Core] Support Lora lineage and base model metadata management (vllm-project#6315)

[Model] Add OLMoE (vllm-project#7922)

[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670)

[Bugfix] Validate SamplingParam n is an int (vllm-project#8548)

[Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649)

[Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556)

[Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640)

[Doc] neuron documentation update (vllm-project#8671)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

[Hardware][AWS] update neuron to 2.20 (vllm-project#8676)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

[Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496)

[Core] Rename `PromptInputs` and `inputs`(vllm-project#8673)

[MISC] add support custom_op check (vllm-project#8557)

Co-authored-by: youkaichao <youkaichao@126.com>

[Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675)

[beam search] add output for manually checking the correctness (vllm-project#8684)

[Kernel] Build flash-attn from source (vllm-project#8245)

[VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687)

[Doc] Fix typo in AMD installation guide (vllm-project#8689)

[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646)

[dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518)

[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643)

[Bugfix] Refactor composite weight loading logic (vllm-project#8656)

[ci][build] fix vllm-flash-attn (vllm-project#8699)

[Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407)

[Misc] Use NamedTuple in Multi-image example (vllm-project#8705)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703)

[Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486)

Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701)

[build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713)

[misc] upgrade mistral-common (vllm-project#8715)

[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702)

[Bugfix] Fix CPU CMake build (vllm-project#8723)

Co-authored-by: Yuan <yuan.zhou@intel.com>

[Bugfix] fix docker build for xpu (vllm-project#8652)

[Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[Hardware][CPU] Refactor CPU model runner (vllm-project#8729)

[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733)

[Model] Support pp for qwen2-vl (vllm-project#8696)

[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707)

[CI/Build] use setuptools-scm to set __version__ (vllm-project#4738)

Co-authored-by: youkaichao <youkaichao@126.com>

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701)

Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[Kernel][LoRA]  Add assertion for punica sgmv kernels (vllm-project#7585)

[Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562)

Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335)

[Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674)

Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728)

re-implement beam search on top of vllm core (vllm-project#8726)

Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>

Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750)

[MISC] Skip dumping inputs when unpicklable (vllm-project#8744)

[Core][Model] Support loading weights by ID within models (vllm-project#7931)

[Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558)

[Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661)

Co-authored-by: mgoin <michael@neuralmagic.com>

[Frontend] Batch inference for llm.chat() API  (vllm-project#8648)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

[Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748)

[CI/Build] fix setuptools-scm usage (vllm-project#8771)

[misc] soft drop beam search (vllm-project#8763)

[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768)

[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

[Core] Adding Priority Scheduling (vllm-project#5958)

[Bugfix] Use heartbeats instead of health checks (vllm-project#8583)

Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780)

[Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776)

Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752)

[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250)

[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770)

[Bugfix] load fc bias from config for eagle (vllm-project#8790)

[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (vllm-project#8672)

[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (vllm-project#8767)

Signed-off-by: darthhexx <darthhexx@gmail.com>

[Misc] Fix minor typo in scheduler (vllm-project#8765)

[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (vllm-project#8777)

[Kernel] Fullgraph and opcheck tests (vllm-project#8479)

[[Misc]] Add extra deps for openai server image (vllm-project#8792)

[VLM][Bugfix] internvl with num_scheduler_steps > 1 (vllm-project#8614)

rename PromptInputs and inputs with backward compatibility (vllm-project#8760)

[Frontend] MQLLMEngine supports profiling. (vllm-project#8761)

[Misc] Support FP8 MoE for compressed-tensors (vllm-project#8588)

Revert "rename PromptInputs and inputs with backward compatibility (vllm-project#8760) (vllm-project#8810)

[Model] Add support for the multi-modal Llama 3.2 model (vllm-project#8811)

Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

[Doc] Update doc for Transformers 4.45 (vllm-project#8817)

[Misc] Support quantization of MllamaForCausalLM (vllm-project#8822)

[Misc] Update config loading for Qwen2-VL and remove Granite (vllm-project#8837)

[Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814)

[Docs] Add README to the build docker image (vllm-project#8825)

[CI/Build] Fix missing ci dependencies (vllm-project#8834)

[misc][installation] build from source without compilation (vllm-project#8818)

[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872)

Signed-off-by: kevin <kevin@anyscale.com>

[Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861)

[Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820)

[Bugfix] Fix print_warning_once's line info (vllm-project#8867)

fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568)

[Bugfix] Fixup advance_step.cu warning (vllm-project#8815)

[BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829)

[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764)

[Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

[Core] rename`PromptInputs` and `inputs` (vllm-project#8876)

[misc] fix collect env (vllm-project#8894)

[MISC] Fix invalid escape sequence '\' (vllm-project#8830)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

[Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892)

[TPU] Update pallas.py to support trillium (vllm-project#8871)

[torch.compile] use empty tensor instead of None for profiling (vllm-project#8875)

[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271)

[Bugfix] fix for deepseek w4a16 (vllm-project#8906)

Co-authored-by: mgoin <michael@neuralmagic.com>

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378)

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

[misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911)

[Core] Priority-based scheduling in async engine (vllm-project#8850)

[misc] fix wheel name (vllm-project#8919)

[Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824)

Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Co-authored-by: youkaichao <youkaichao@126.com>

[Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921)

[Bugfix] Fix code for downloading models from modelscope (vllm-project#8443)

[Bugfix] Fix PP for Multi-Step (vllm-project#8887)

[CI/Build] Update models tests & examples (vllm-project#8874)

Co-authored-by: Roger Wang <ywang@roblox.com>

[Frontend] Make beam search emulator temperature modifiable (vllm-project#8928)

Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr>

[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891)

[doc] organize installation doc and expose per-commit docker (vllm-project#8931)

[Core] Improve choice of Python multiprocessing method (vllm-project#8823)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: youkaichao <youkaichao@126.com>

[Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824)

[Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741)

Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925)

[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930)

[Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896)

[Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199)

[BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870)

Co-authored-by: Roger Wang <ywang@roblox.com>

[Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944)

[Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942)

[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533)
  • Loading branch information
MengqingCao committed Oct 10, 2024
1 parent bfa7ee1 commit af569eb
Show file tree
Hide file tree
Showing 541 changed files with 34,240 additions and 12,115 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Expand Down
7 changes: 6 additions & 1 deletion .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,15 @@ def test_lm_eval_correctness():
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
success = success and numpy.isclose(
ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
3 changes: 1 addition & 2 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ steps:
containers:
- image: badouralix/curl-jq
command:
- sh
- .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- wait
- label: "A100"
agents:
Expand Down
4 changes: 3 additions & 1 deletion .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"

TIMEOUT_SECONDS=10

retries=0
while [ $retries -lt 1000 ]; do
if [ $(curl -s -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
if [ $(curl -s --max-time $TIMEOUT_SECONDS -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then
exit 0
fi

Expand Down
5 changes: 3 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ steps:
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
- "for f in artifacts/dist/*.whl; do mv -- \"$$f\" \"$${f/linux/manylinux1}\"; done"
- "aws s3 cp --recursive artifacts/dist s3://vllm-wheels/$BUILDKITE_COMMIT/"
- "aws s3 cp --recursive artifacts/dist s3://vllm-wheels/nightly/"
- "mv artifacts/dist/$(ls artifacts/dist) artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
- "aws s3 cp artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl s3://vllm-wheels/$BUILDKITE_COMMIT/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
- "aws s3 cp artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl s3://vllm-wheels/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
env:
DOCKER_BUILDKIT: "1"

Expand Down
36 changes: 35 additions & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,47 @@ mkdir -p ${HF_CACHE}
HF_MOUNT="/root/.cache/huggingface"

commands=$@
echo "Commands:$commands"
#ignore certain kernels tests
if [[ $commands == *" kernels "* ]]; then
commands="${commands} \
--ignore=kernels/test_attention.py \
--ignore=kernels/test_attention_selector.py \
--ignore=kernels/test_blocksparse_attention.py \
--ignore=kernels/test_causal_conv1d.py \
--ignore=kernels/test_cutlass.py \
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_gguf.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
--ignore=kernels/test_marlin_gemm.py \
--ignore=kernels/test_moe.py \
--ignore=kernels/test_prefix_prefill.py \
--ignore=kernels/test_rand.py \
--ignore=kernels/test_sampler.py"
fi

#ignore certain Entrypoints tests
if [[ $commands == *" entrypoints/openai "* ]]; then
commands=${commands//" entrypoints/openai "/" entrypoints/openai \
--ignore=entrypoints/openai/test_accuracy.py \
--ignore=entrypoints/openai/test_audio.py \
--ignore=entrypoints/openai/test_encoder_decoder.py \
--ignore=entrypoints/openai/test_embedding.py \
--ignore=entrypoints/openai/test_oot_registration.py "}
fi

PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
#replace shard arguments
commands=${@//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
echo "Shard ${GPU} commands:$commands"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
Expand Down
3 changes: 2 additions & 1 deletion .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
source /etc/environment
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN=$HF_TOKEN --name cpu-test cpu-test

# Run basic model test
docker exec cpu-test bash -c "
Expand Down
18 changes: 11 additions & 7 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,17 @@ docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py \
--ignore=tests/models/test_oot_registration.py \
--ignore=tests/models/test_registry.py \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/test_jamba.py \
--ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
pip install pytest matplotlib einops transformers_stream_generator datamodel_code_generator
pytest -v -s tests/models/decoder_only/language \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/decoder_only/language/test_jamba.py \
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# Run compressed-tensor test
docker exec cpu-test bash -c "
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynanmic_per_token"

# online inference
docker exec cpu-test bash -c "
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path xpu-test python3 examples/offline_inference.py
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py
Loading

0 comments on commit af569eb

Please sign in to comment.