Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Upstream sync 2024 06 08 #288

Merged
merged 101 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
e69d23b
[Kernel] Add marlin_24 unit tests (#4901)
alexm-neuralmagic May 19, 2024
81ec16b
[Kernel] Add flash-attn back (#4907)
WoosukKwon May 20, 2024
5500975
[Model] LLaVA model refactor (#4910)
DarkLight1337 May 20, 2024
b913d04
Remove marlin warning (#4918)
alexm-neuralmagic May 20, 2024
683a30b
[Misc]: allow user to specify port in distributed setting (#4914)
ZwwWayne May 20, 2024
c8794c3
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Alexei-V-Ivanov-AMD May 20, 2024
5b6a7b5
[Bugfix] Fix dummy weight for fp8 (#4916)
mzusman May 20, 2024
a5e66c7
[Core] Sharded State Loader download from HF (#4889)
aurickq May 20, 2024
8a78ed8
[Doc]Add documentation to benchmarking script when running TGI (#4920)
KuntaiDu May 20, 2024
6b46dcf
[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897)
Yard1 May 21, 2024
907d48a
[Model] add rope_scaling support for qwen2 (#4930)
hzhwcmhf May 21, 2024
11d6f7e
[Model] Add Phi-2 LoRA support (#4886)
Isotr0py May 21, 2024
5d98989
[Docs] Add acknowledgment for sponsors (#4925)
simon-mo May 21, 2024
58a235b
[CI/Build] Codespell ignore `build/` directory (#4945)
mgoin May 21, 2024
253d8fb
[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935)
kerthcet May 21, 2024
f744125
[Bugfix][Kernel] Add head size check for attention backend selection …
Isotr0py May 21, 2024
c1672a9
[Frontend] Dynamic RoPE scaling (#4638)
sasha0552 May 22, 2024
4b6c961
[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#…
mgoin May 22, 2024
4b74974
[misc] remove comments that were supposed to be removed (#4977)
rkooo567 May 22, 2024
39c15ee
[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)
tlrmchlsmth May 22, 2024
2835fc6
[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)
comaniac May 22, 2024
3db99a6
[Model] LoRA gptbigcode implementation (#3949)
raywanb May 22, 2024
39a0a40
[Core] Eliminate parallel worker per-step task scheduling overhead (#…
njhill May 22, 2024
847ca88
[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…
pcmoritz May 22, 2024
c60384c
[Misc] Take user preference in attention selector (#4960)
comaniac May 22, 2024
dae5aaf
Marlin 24 prefill performance improvement (about 25% better on averag…
alexm-neuralmagic May 23, 2024
05a4f64
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…
LetianLee May 23, 2024
bf4c411
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
andoorve May 23, 2024
c623663
[Kernel] Initial Activation Quantization Support (#4525)
dsikka May 23, 2024
a9ca32d
[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)
kezouke May 23, 2024
0eb33b1
[Doc] add ccache guide in doc (#5012)
youkaichao May 23, 2024
acf362c
[Kernel] Initial Activation Quantization Support (#4525)
robertgshaw2-neuralmagic May 24, 2024
1226d5d
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
leiwen83 May 24, 2024
29a2098
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3…
linxihui May 25, 2024
3fe7e52
[Misc] add logging level env var (#5045)
youkaichao May 25, 2024
8768b3f
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding …
LiuXiaoxuanPKU May 25, 2024
e7e376f
[Misc] Make Serving Benchmark More User-friendly (#5044)
ywang96 May 25, 2024
67ce9ea
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
zhuohan123 May 27, 2024
2c59c91
[Core] Allow AQLM on Pascal (#5058)
sasha0552 May 27, 2024
9fb7b82
[Model] Add support for falcon-11B (#5069)
Isotr0py May 27, 2024
954c332
[Core] Sliding window for block manager v2 (#4545)
mmoskal May 28, 2024
9929fb2
[BugFix] Fix Embedding Models with TP>1 (#5075)
robertgshaw2-neuralmagic May 28, 2024
b22d985
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)
divakar-amd May 28, 2024
54c17a9
[Docs] Add Dropbox as sponsors (#5089)
simon-mo May 28, 2024
8c9aab4
[Core] Consolidate prompt arguments to LLM engines (#4328)
DarkLight1337 May 28, 2024
705789d
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
jsato8094 May 29, 2024
95c2a3d
[Misc] add gpu_memory_utilization arg (#5079)
pandyamarut May 29, 2024
9175890
[Core][Optimization] remove vllm-nccl (#5091)
youkaichao May 29, 2024
420c4ff
[Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092)
DarkLight1337 May 29, 2024
5bde5ba
[Core][Distributed] improve p2p access check (#4992)
youkaichao May 29, 2024
b86aa89
[Core] Cross-attention KV caching and memory-management (towards even…
afeldman-nm May 29, 2024
f63e8dd
[Doc]Replace deprecated flag in readme (#4526)
ronensc May 29, 2024
62a4fcb
[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterato…
DarkLight1337 May 29, 2024
f900bcc
[Bugfix][CI/Build] Fix codespell failing to skip files in `git diff` …
DarkLight1337 May 29, 2024
6824b2f
[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099)
DarkLight1337 May 29, 2024
623275f
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031)
Etelis May 29, 2024
15dcd3e
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
youkaichao May 29, 2024
5763c73
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#…
alexm-neuralmagic May 30, 2024
3a8332c
[CI/Build] Docker cleanup functionality for amd servers (#5112)
okakarpa May 30, 2024
11a5a26
[BUGFIX] [FRONTEND] Correct chat logprobs (#5029)
br3no May 30, 2024
2827c68
[Bugfix] Automatically Detect SparseML models (#5119)
robertgshaw2-neuralmagic May 30, 2024
4ae80dd
[CI/Build] increase wheel size limit to 200 MB (#5130)
youkaichao May 30, 2024
886ead6
[Misc] remove duplicate definition of `seq_lens_tensor` in model_runn…
ita9naiwa May 30, 2024
758b903
[Doc] Use intersphinx and update entrypoints docs (#5125)
DarkLight1337 May 30, 2024
a190463
add doc about serving option on dstack (#3074)
deep-diver May 30, 2024
51cf757
Bump version to v0.4.3 (#5046)
simon-mo May 30, 2024
c72d890
[Build] Disable sm_90a in cu11 (#5141)
simon-mo May 30, 2024
cf0711b
[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120)
robertgshaw2-neuralmagic May 31, 2024
dcaf819
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::orde…
alexm-neuralmagic May 31, 2024
7da3c3f
Fix cutlass sm_90a vesrion in CMakeList
simon-mo May 31, 2024
2c66f17
[Model] Support MAP-NEO model (#5081)
xingweiqu May 31, 2024
5388c64
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using th…
simon-mo May 31, 2024
5e9f300
[Misc]: optimize eager mode host time (#4196)
FuncSherl May 31, 2024
f329e2e
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039)
comaniac May 31, 2024
951e3d2
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171)
njhill Jun 1, 2024
d349dbd
[Build] Guard against older CUDA versions when building CUTLASS 3.x k…
tlrmchlsmth Jun 1, 2024
031fd4e
format
robertgshaw2-neuralmagic Jun 8, 2024
9ed5f76
skip blockspase attention
robertgshaw2-neuralmagic Jun 9, 2024
ec71544
fix falcon
robertgshaw2-neuralmagic Jun 9, 2024
7381340
skip sliding window chunked prefill
robertgshaw2-neuralmagic Jun 9, 2024
c23ca05
skip prefix prefill
robertgshaw2-neuralmagic Jun 9, 2024
85512eb
skip tensorizer
robertgshaw2-neuralmagic Jun 9, 2024
0cea2c2
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input…
mgoin Jun 8, 2024
31147df
format
robertgshaw2-neuralmagic Jun 9, 2024
2256610
fix issue with internal method
robertgshaw2-neuralmagic Jun 9, 2024
01973f5
formatting
robertgshaw2-neuralmagic Jun 9, 2024
a1a659d
disabled more kernel tests that use triton
robertgshaw2-neuralmagic Jun 9, 2024
c50784c
updated cutlass skipping. We need cuda 12.4 in automation
robertgshaw2-neuralmagic Jun 9, 2024
99fa9f8
trigger kernel tests in automation
robertgshaw2-neuralmagic Jun 9, 2024
2ec6643
cleanup spurious setup.py change
robertgshaw2-neuralmagic Jun 9, 2024
0bb099c
readded the missing images
robertgshaw2-neuralmagic Jun 9, 2024
198f364
multilora inference
robertgshaw2-neuralmagic Jun 9, 2024
ec0e89a
offline inference with prefix
robertgshaw2-neuralmagic Jun 9, 2024
e6f1cbd
backend request func
robertgshaw2-neuralmagic Jun 9, 2024
ca8d74a
benchmark serving
robertgshaw2-neuralmagic Jun 9, 2024
5335ad9
prod monitoring readme
robertgshaw2-neuralmagic Jun 9, 2024
611cfed
format
robertgshaw2-neuralmagic Jun 9, 2024
73132a5
fix benchmark issue - internal method changed
robertgshaw2-neuralmagic Jun 9, 2024
7f5c715
removed skip for remote push edits
robertgshaw2-neuralmagic Jun 9, 2024
437912e
update internal method in benchmark throughput too
robertgshaw2-neuralmagic Jun 10, 2024
950981c
skip triton sampler tests
robertgshaw2-neuralmagic Jun 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import zipfile

MAX_SIZE_MB = 150
MAX_SIZE_MB = 200


def print_top_10_largest_files(zip_file):
Expand Down
28 changes: 28 additions & 0 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,34 @@ set -ex
echo "--- ROCm info"
rocminfo

# cleanup older docker images
cleanup_docker() {
# Get Docker's root directory
docker_root=$(docker info -f '{{.DockerRootDir}}')
if [ -z "$docker_root" ]; then
echo "Failed to determine Docker root directory."
exit 1
fi
echo "Docker root directory: $docker_root"
# Check disk usage of the filesystem where Docker's root directory is located
disk_usage=$(df "$docker_root" | tail -1 | awk '{print $5}' | sed 's/%//')
# Define the threshold
threshold=70
if [ "$disk_usage" -gt "$threshold" ]; then
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
# Remove unused volumes
docker volume prune -f
echo "Docker images and volumes cleanup completed."
else
echo "Disk usage is below $threshold%. No cleanup needed."
fi
}

# Call the cleanup docker function
cleanup_docker

echo "--- Resetting GPUs"

echo "reset" > /opt/amdgpu/etc/gpu_state
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_CPU_KVCACHE_SPACE=1 --name cpu-test cpu-test python3 examples/offline_inference.py
docker run --network host --env VLLM_CPU_KVCACHE_SPACE=1 --name cpu-test cpu-test python3 vllm/examples/offline_inference.py
13 changes: 8 additions & 5 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ steps:
working_dir: "/vllm-workspace/tests"
num_gpus: 2
commands:
- pytest -v -s distributed/test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
Expand All @@ -60,11 +59,12 @@ steps:
command: pytest -v -s engine tokenization test_sequence.py test_config.py test_logger.py

- label: Entrypoints Test
#mirror_hardwares: [amd]
mirror_hardwares: [amd]

commands:
# these tests have to be separated, because each one will allocate all posible GPU memory
- pytest -v -s entrypoints --ignore=entrypoints/test_server_oot_registration.py
- pytest -v -s entrypoints/test_server_oot_registration.py
- pytest -v -s test_inputs.py
- pytest -v -s entrypoints -m llm
- pytest -v -s entrypoints -m openai

- label: Examples Test
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -109,6 +109,9 @@ steps:
mirror_hardwares: [amd]
command: pytest -v -s test_logits_processor.py

- label: Utils Test
command: pytest -v -s test_utils.py

- label: Worker Test
mirror_hardwares: [amd]
command: pytest -v -s worker
Expand Down
26 changes: 26 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
BasedOnStyle: Google
UseTab: Never
IndentWidth: 2
ColumnLimit: 80

# Force pointers to the type for C++.
DerivePointerAlignment: false
PointerAlignment: Left

# Reordering #include statements can (and currently will) introduce errors
SortIncludes: false

# Style choices
AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
IndentPPDirectives: BeforeHash

IncludeCategories:
- Regex: '^<'
Priority: 4
- Regex: '^"(llvm|llvm-c|clang|clang-c|mlir|mlir-c)/'
Priority: 3
- Regex: '^"(qoda|\.\.)/'
Priority: 2
- Regex: '.*'
Priority: 1
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/400-bug report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ body:

Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.

Please set the environment variable `export VLLM_LOGGING_LEVEL=DEBUG` to turn on more logging to help debugging potential issues.

If you experienced crashes or hangs, it would be helpful to run vllm with `export VLLM_TRACE_FUNCTION=1` . All the function calls in vllm will be recorded. Inspect these log files, and tell which function crashes or hangs.
placeholder: |
A clear and concise description of what the bug is.
Expand Down
42 changes: 42 additions & 0 deletions .github/workflows/clang-format.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: clang-format

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
pull_request:
branches:
- main

jobs:
clang-format:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install clang-format==18.1.5
- name: Running clang-format
run: |
EXCLUDES=(
'csrc/moe/topk_softmax_kernels.cu'
'csrc/punica/bgmv/bgmv_bf16_bf16_bf16.cu'
'csrc/punica/bgmv/bgmv_config.h'
'csrc/punica/bgmv/bgmv_impl.cuh'
'csrc/punica/bgmv/vec_dtypes.cuh'
'csrc/punica/punica_ops.cu'
'csrc/punica/type_convert.h'
)
find csrc/ \( -name '*.h' -o -name '*.cpp' -o -name '*.cu' -o -name '*.cuh' \) -print \
| grep -vFf <(printf "%s\n" "${EXCLUDES[@]}") \
| xargs clang-format --dry-run --Werror
15 changes: 9 additions & 6 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ set(VLLM_EXT_SRC
"csrc/layernorm_kernels.cu"
"csrc/quantization/squeezellm/quant_cuda_kernel.cu"
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/moe_align_block_size_kernels.cu"
Expand All @@ -176,7 +177,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
include(FetchContent)
SET(CUTLASS_ENABLE_HEADERS_ONLY=ON)
FetchContent_Declare(
cutlass
cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
# CUTLASS 3.5.0
GIT_TAG 7d49e6c7e2f8896c47f586706e67e1fb215529dc
Expand All @@ -199,11 +200,13 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# The CUTLASS kernels for Hopper require sm90a to be enabled.
# This is done via the below gencode option, BUT that creates kernels for both sm90 and sm90a.
# That adds an extra 17MB to compiled binary, so instead we selectively enable it.
set_source_files_properties(
"csrc/quantization/cutlass_w8a8/scaled_mm_dq_c3x.cu"
PROPERTIES
COMPILE_FLAGS
"-gencode arch=compute_90a,code=sm_90a")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0)
set_source_files_properties(
"csrc/quantization/cutlass_w8a8/scaled_mm_dq_c3x.cu"
PROPERTIES
COMPILE_FLAGS
"-gencode arch=compute_90a,code=sm_90a")
endif()

endif()

Expand Down
2 changes: 2 additions & 0 deletions Dockerfile.cpu
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@ RUN pip install -v -r requirements-cpu.txt --extra-index-url https://download.py

RUN VLLM_TARGET_DEVICE=cpu python3 setup.py install

WORKDIR /workspace/

CMD ["/bin/bash"]
8 changes: 6 additions & 2 deletions Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -92,19 +92,23 @@ RUN if [ "$BUILD_TRITON" = "1" ]; then \
WORKDIR /vllm-workspace
COPY . .

#RUN python3 -m pip install pynvml # to be removed eventually
RUN python3 -m pip install --upgrade pip numba

# make sure punica kernels are built (for LoRA)
ENV VLLM_INSTALL_PUNICA_KERNELS=1
# Workaround for ray >= 2.10.0
ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1

ENV VLLM_NCCL_SO_PATH=/opt/rocm/lib/librccl.so

RUN --mount=type=cache,target=/root/.cache/pip \
pip install -U -r requirements-rocm.txt \
&& patch /opt/rocm/include/hip/amd_detail/amd_hip_bf16.h ./rocm_patch/rocm_bf16.patch \
&& python3 setup.py install \
&& cp build/lib.linux-x86_64-cpython-39/vllm/_C.cpython-39-x86_64-linux-gnu.so vllm/ \
&& cp build/lib.linux-x86_64-cpython-39/vllm/_punica_C.cpython-39-x86_64-linux-gnu.so vllm/ \
&& cd ..

RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir ray[all]==2.9.3

CMD ["/bin/bash"]
10 changes: 6 additions & 4 deletions benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
# flake8: noqa
# UPSTREAM SYNC: noqa is required for passing ruff run on nm-automation
# This file has been modified by Neural Magic

import json
import os
import sys
Expand Down Expand Up @@ -93,6 +89,9 @@ async def async_request_tgi(
output.latency = most_recent_timestamp - st
output.success = True
output.generated_text = data["generated_text"]
else:
output.error = response.reason or ""
output.success = False
except Exception:
output.success = False
exc_info = sys.exc_info()
Expand Down Expand Up @@ -280,6 +279,9 @@ async def async_request_openai_completions(
output.generated_text = generated_text
output.success = True
output.latency = latency
else:
output.error = response.reason or ""
output.success = False
except Exception:
output.success = False
exc_info = sys.exc_info()
Expand Down
34 changes: 21 additions & 13 deletions benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
import json
import time
from pathlib import Path
from typing import Optional
from typing import List, Optional

import numpy as np
import torch
from tqdm import tqdm

from vllm import LLM, SamplingParams
from vllm.inputs import PromptStrictInputs
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS


Expand All @@ -34,7 +35,8 @@ def main(args: argparse.Namespace):
use_v2_block_manager=args.use_v2_block_manager,
enable_chunked_prefill=args.enable_chunked_prefill,
download_dir=args.download_dir,
block_size=args.block_size)
block_size=args.block_size,
gpu_memory_utilization=args.gpu_memory_utilization)

sampling_params = SamplingParams(
n=args.n,
Expand All @@ -48,7 +50,9 @@ def main(args: argparse.Namespace):
dummy_prompt_token_ids = np.random.randint(10000,
size=(args.batch_size,
args.input_len))
dummy_prompt_token_ids = dummy_prompt_token_ids.tolist()
dummy_inputs: List[PromptStrictInputs] = [{
"prompt_token_ids": batch
} for batch in dummy_prompt_token_ids.tolist()]

def run_to_completion(profile_dir: Optional[str] = None):
if profile_dir:
Expand All @@ -59,13 +63,13 @@ def run_to_completion(profile_dir: Optional[str] = None):
],
on_trace_ready=torch.profiler.tensorboard_trace_handler(
str(profile_dir))) as p:
llm.generate(prompt_token_ids=dummy_prompt_token_ids,
llm.generate(dummy_inputs,
sampling_params=sampling_params,
use_tqdm=False)
print(p.key_averages())
else:
start_time = time.perf_counter()
llm.generate(prompt_token_ids=dummy_prompt_token_ids,
llm.generate(dummy_inputs,
sampling_params=sampling_params,
use_tqdm=False)
end_time = time.perf_counter()
Expand Down Expand Up @@ -153,15 +157,13 @@ def run_to_completion(profile_dir: Optional[str] = None):
action='store_true',
help='enforce eager mode and disable CUDA graph')
parser.add_argument(
"--kv-cache-dtype",
'--kv-cache-dtype',
type=str,
choices=['auto', 'fp8'],
default='auto',
help=
'Data type for kv cache storage. If "auto", will use model data type. '
'FP8_E5M2 (without scaling) is only supported on cuda version greater '
'than 11.8. On ROCm (AMD GPU), FP8_E4M3 is '
'instead supported for common inference criteria.')
choices=['auto', 'fp8', 'fp8_e5m2', 'fp8_e4m3'],
default="auto",
help='Data type for kv cache storage. If "auto", will use model '
'data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. '
'ROCm (AMD GPU) supports fp8 (=fp8_e4m3)')
parser.add_argument(
'--quantization-param-path',
type=str,
Expand Down Expand Up @@ -213,5 +215,11 @@ def run_to_completion(profile_dir: Optional[str] = None):
type=str,
default=None,
help='Path to save the latency results in JSON format.')
parser.add_argument('--gpu-memory-utilization',
type=float,
default=0.9,
help='the fraction of GPU memory to be used for '
'the model executor, which can range from 0 to 1.'
'If unspecified, will use the default value of 0.9.')
args = parser.parse_args()
main(args)
Loading
Loading