Batched benchmark script and more detailed benchmark metrics #25

zhuohan123 · 2023-04-03T14:12:15Z

No description provided.

zhuohan123 · 2023-06-03T02:57:21Z

Close this PR since it's too outdated.

Disable weight compression on optimum-intel conversion path

Cherry-pick of fix commit 6100f4b from ODH: opendatahub-io#17 --------- Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com>

Ibm main update 2024-05-16

Updates to custom PagedAttention for supporting context len upto 32k.

* Fix setup.py for HPU * Fix vllm._C import ops -> vllm.hpu import ops * more of the same thing * re-add hpex rmsnorm and rope; but rope is crashing * remove unnecessary comments * add vllm/hpu files * add hpu autodetection * Add HabanaAttention stub * revert accidental changes * revert non-habana backend attention changes * add habana attention/worker/executor, sampling fails now * Restore unnecessarily changed files * enable HabanaMemoryProfiler * Make sampler pass * restore habana fused rope * prefill is now working!!! * fix prefill padding; decode is now working!!!!! * revert accidental changes * remove unused stuff in habana_paged_attn.py * remove diagnostic stuff from llm_engine.py * use HabanaExecutorAsync in async_llm_engine.py * add habana copyright headers to habana_*.py files * fix prefill attention conformance * minor naming fixes * remove naive attention from habana_attn (it never worked anyway) * re-enable profile run * Add fake HPUGraph support * add more metrics * indentation fix * ~~recipe cache metrics don't work lalalala~~ * i'm done with metrics for now * fix corner case in which hl-smi is not available but synapse is * FIXME: temporary setup.py workaround * WIP: add tensor parallelism stubs * habana worker cleanup * tensor parallelism is now working * remove unused files * remove unused func * add hpugraphrunner * improve hpu layernorm * Port pipelined PA * Port context length bucketing * remove cudagraphrunner from hpu runner * restore HPUGraphRunner back from FakeHPUGraphRunner * handle rotary embeddings properly on gaudi3 * oopsie! captured_block_counts was incorrect! * captured_block_counts.append doesn't do anything * Restore habana_main KV cache memory layout * fix memory profiler * overhaul hpugraph capture * memory profiling overhaul * format memory properly in model warmup * add graph compilation profiler for graph capture phase * adroll back log lvl on graph capture message * Remove unnecessary view on residual connection in RMSNorm (vllm-project#25) --------- Co-authored-by: madamczykhabana <110973826+madamczykhabana@users.noreply.github.com>

Add IPEX MOE CPU support

Overlap io

### What this PR does / why we need it? Fix device of tensors created in `AscendAttentionBackendImpl`. While specifying device to cards except card-0, there'll cause an **device conflict** because the tensors (such as `attn_mask`) will be put on card-0 by default. This pr creates these tensors on the correct card corresponding to the input. ### Does this PR introduce _any_ user-facing change? User could specify device with local rank by this pr, and a modify on vLLM is also needed, will related to this pr when created. ### How was this patch tested? This is tested by the following code locally. Will add a test case when the modify in vLLM is also completed. ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.0) # Create an LLM. llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1") # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Signed-off-by: MengqingCao <cmq0113@163.com>

Batched benchmark script and more detailed benchmark metrics.

3564cd9

zhuohan123 requested a review from WoosukKwon April 3, 2023 14:12

zhuohan123 closed this Jun 3, 2023

zhuohan123 deleted the benchmark-script branch June 18, 2023 07:22

zhuohan123 restored the benchmark-script branch June 18, 2023 07:22

zhuohan123 deleted the benchmark-script branch June 18, 2023 07:22

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Apr 17, 2024

Merge pull request vllm-project#25 from slyalin/disable_int8

f35263f

Disable weight compression on optimum-intel conversion path

dtrifiro added a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

Merge pull request vllm-project#25 from z103cb/ibm_main_update_05162022

81954a7

Ibm main update 2024-05-16

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#25 from ROCm/cl/updates-pag-shomy

0cd6239

Updates to custom PagedAttention for supporting context len upto 32k.

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Jun 25, 2024

Merge pull request vllm-project#25 from jianan-gu/jianan/enable_moe

d51ea18

Add IPEX MOE CPU support

yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024

Add bias support for sparse layers (vllm-project#25)

ab469e5

yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024

Add bias support for sparse layers (vllm-project#25)

e802bc2

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

njhill pushed a commit to njhill/vllm that referenced this pull request Nov 6, 2024

Merge pull request vllm-project#25 from njhill/overlap_io

e3014e2

Overlap io

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Open

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Batched benchmark script and more detailed benchmark metrics #25

Batched benchmark script and more detailed benchmark metrics #25

Uh oh!

zhuohan123 commented Apr 3, 2023

Uh oh!

zhuohan123 commented Jun 3, 2023

Uh oh!

Uh oh!

Uh oh!

Batched benchmark script and more detailed benchmark metrics #25

Batched benchmark script and more detailed benchmark metrics #25

Uh oh!

Conversation

zhuohan123 commented Apr 3, 2023

Uh oh!

zhuohan123 commented Jun 3, 2023

Uh oh!

Uh oh!