[Core][Model] Use torch.compile to accelerate layernorm in commandr #3985

youkaichao · 2024-04-11T00:35:43Z

If we just add @torch.compile to forward function, pytorch will try to trace self, and everytime self changes, it tries to re-compile.

However, self actually does not matter in our computation. We just need a standalone layernorm function. So we can extract the computation in a standalone way, so that torch only compile once.

benchmark script:

python benchmark_latency.py --model CohereForAI/c4ai-command-r-v01 -tp=4 --batch-size=1

without compile:
Avg latency: 2.43594957968841 seconds
10% percentile latency: 2.42827770607546 seconds
25% percentile latency: 2.429899246431887 seconds
50% percentile latency: 2.433337311260402 seconds
75% percentile latency: 2.436360448366031 seconds
90% percentile latency: 2.4394099053926768 seconds

with compile on forward:
Avg latency: 2.434882906700174 seconds
10% percentile latency: 2.4297383648343382 seconds
25% percentile latency: 2.4304619752801955 seconds
50% percentile latency: 2.433090591803193 seconds
75% percentile latency: 2.436410292983055 seconds
90% percentile latency: 2.441519308835268 seconds

with compile on standalone function:

Avg latency: 2.3143214230115214 seconds
10% percentile latency: 2.3101052045822144 seconds
25% percentile latency: 2.312119716545567 seconds
50% percentile latency: 2.3148383810184896 seconds
75% percentile latency: 2.316791471093893 seconds
90% percentile latency: 2.3175419165752826 seconds

More results:

latency benchmark with default batchsize=8

python benchmark_latency.py --model CohereForAI/c4ai-command-r-v01 -tp=4

without this PR:
Avg latency: 2.900428555874775 seconds
10% percentile latency: 2.8955503132194282 seconds
25% percentile latency: 2.897518435958773 seconds
50% percentile latency: 2.9004721106030047 seconds
75% percentile latency: 2.9027304998598993 seconds
90% percentile latency: 2.9058421747758985 seconds

with this PR:
Avg latency: 2.73428927740703 seconds
10% percentile latency: 2.7297521037049592 seconds
25% percentile latency: 2.7316771240439266 seconds
50% percentile latency: 2.732553725130856 seconds
75% percentile latency: 2.734609372681007 seconds
90% percentile latency: 2.744487049616873 seconds

throughput benchmark with default arg

python benchmark_throughput.py --model CohereForAI/c4ai-command-r-v01 --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp=4

Without this PR:
Throughput: 7.52 requests/s, 3290.22 tokens/s

With this PR:
Throughput: 7.91 requests/s, 3463.78 tokens/s

Summary

5% latency & throughput improvement

) [Core][Model] Use torch.compile to accelerate layernorm in commandr (vllm-project#3985)

manually extract func for torch.compile

c367326

simon-mo approved these changes Apr 11, 2024

View reviewed changes

youkaichao enabled auto-merge (squash) April 11, 2024 01:14

youkaichao merged commit caada5e into vllm-project:main Apr 11, 2024

youkaichao deleted the commandr_compile branch April 11, 2024 01:49

SageMoore pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 11, 2024

[Core][Model] torch.compile for layernorm in commandr (vllm-project#3985

782c1cf

) [Core][Model] Use torch.compile to accelerate layernorm in commandr (vllm-project#3985)

sgsdxzy mentioned this pull request Apr 11, 2024

Improve cohere model. aphrodite-engine/aphrodite-engine#404

Merged

andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024

[Core][Model] torch.compile for layernorm in commandr (vllm-project#3985

35295f1

) [Core][Model] Use torch.compile to accelerate layernorm in commandr (vllm-project#3985)

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Core][Model] torch.compile for layernorm in commandr (vllm-project#3985

ec7eadf

) [Core][Model] Use torch.compile to accelerate layernorm in commandr (vllm-project#3985)

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

This was referenced Jun 12, 2024

[Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446

Closed

[dynamo][inline_inbuilt_nn_modules] Unnecessary compilation fails to optimize simple code pytorch/pytorch#125652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core][Model] Use torch.compile to accelerate layernorm in commandr #3985

[Core][Model] Use torch.compile to accelerate layernorm in commandr #3985

Uh oh!

youkaichao commented Apr 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Core][Model] Use torch.compile to accelerate layernorm in commandr #3985

[Core][Model] Use torch.compile to accelerate layernorm in commandr #3985

Uh oh!

Conversation

youkaichao commented Apr 11, 2024

latency benchmark with default batchsize=8

throughput benchmark with default arg

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants