[Core][Model] Use torch.compile to accelerate layernorm in commandr #3985
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If we just add
@torch.compiletoforwardfunction, pytorch will try to traceself, and everytimeselfchanges, it tries to re-compile.However,
selfactually does not matter in our computation. We just need a standalone layernorm function. So we can extract the computation in a standalone way, so thattorchonly compile once.benchmark script:
without compile:
Avg latency: 2.43594957968841 seconds
10% percentile latency: 2.42827770607546 seconds
25% percentile latency: 2.429899246431887 seconds
50% percentile latency: 2.433337311260402 seconds
75% percentile latency: 2.436360448366031 seconds
90% percentile latency: 2.4394099053926768 seconds
with compile on
forward:Avg latency: 2.434882906700174 seconds
10% percentile latency: 2.4297383648343382 seconds
25% percentile latency: 2.4304619752801955 seconds
50% percentile latency: 2.433090591803193 seconds
75% percentile latency: 2.436410292983055 seconds
90% percentile latency: 2.441519308835268 seconds
with compile on standalone function:
Avg latency: 2.3143214230115214 seconds
10% percentile latency: 2.3101052045822144 seconds
25% percentile latency: 2.312119716545567 seconds
50% percentile latency: 2.3148383810184896 seconds
75% percentile latency: 2.316791471093893 seconds
90% percentile latency: 2.3175419165752826 seconds
More results:
latency benchmark with default batchsize=8
without this PR:
Avg latency: 2.900428555874775 seconds
10% percentile latency: 2.8955503132194282 seconds
25% percentile latency: 2.897518435958773 seconds
50% percentile latency: 2.9004721106030047 seconds
75% percentile latency: 2.9027304998598993 seconds
90% percentile latency: 2.9058421747758985 seconds
with this PR:
Avg latency: 2.73428927740703 seconds
10% percentile latency: 2.7297521037049592 seconds
25% percentile latency: 2.7316771240439266 seconds
50% percentile latency: 2.732553725130856 seconds
75% percentile latency: 2.734609372681007 seconds
90% percentile latency: 2.744487049616873 seconds
throughput benchmark with default arg
Without this PR:
Throughput: 7.52 requests/s, 3290.22 tokens/s
With this PR:
Throughput: 7.91 requests/s, 3463.78 tokens/s
Summary
5% latency & throughput improvement