Remove redundant code in ColumnParallelLinear #146

WoosukKwon · 2023-06-11T02:15:55Z

This PR deletes redundant code in ColumnParallelLinear to reduce CPU overhead.

Before:

$ python benchmarks/benchmark_latency.py --model facebook/opt-6.7b
Namespace(model='facebook/opt-6.7b', tensor_parallel_size=1, input_len=32, output_len=128, batch_size=8, n=1, use_beam_search=False, num_iters=3, profile=False)
INFO 06-11 02:01:10 llm_server.py:60] Initializing an LLM server with config: model='facebook/opt-6.7b', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-11 02:01:22 llm_server.py:129] # GPU blocks: 2946, # CPU blocks: 512
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Warming up...
Profiling iterations: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:08<00:00,  2.68s/it]
Avg latency: 2.6759115854899087 seconds

After:

$ python benchmarks/benchmark_latency.py --model facebook/opt-6.7b
Namespace(model='facebook/opt-6.7b', tensor_parallel_size=1, input_len=32, output_len=128, batch_size=8, n=1, use_beam_search=False, num_iters=3, profile=False)
INFO 06-11 02:02:10 llm_server.py:60] Initializing an LLM server with config: model='facebook/opt-6.7b', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-11 02:02:21 llm_server.py:129] # GPU blocks: 2946, # CPU blocks: 512
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, stop=[], ignore_eos=True, max_tokens=128, logprobs=None)
Warming up...
Profiling iterations: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00,  2.36s/it]
Avg latency: 2.363809108734131 seconds

zhuohan123

LGTM! This does not make any sense to me... This function is just an identity mapping in Torch. Should cost nothing.

vllm-project#146)

WoosukKwon added 2 commits June 11, 2023 01:57

Add shortcut

b55b1ee

Simplify

c45a2dd

WoosukKwon requested a review from zhuohan123 June 11, 2023 02:15

WoosukKwon changed the title ~~Short-circuit ColumnParallelLinear to reduce CPU overhead~~ Remove redundant code in ColumnParallelLinear Jun 11, 2023

zhuohan123 approved these changes Jun 11, 2023

View reviewed changes

WoosukKwon merged commit da5ddcd into main Jun 11, 2023

WoosukKwon deleted the opt branch June 11, 2023 04:25

WoosukKwon mentioned this pull request Aug 2, 2023

Add Falcon support (new) #592

Merged

6 tasks

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Remove redundant code in ColumnParallelLinear (vllm-project#146)

68623eb

mht-sharma pushed a commit to mht-sharma/vllm that referenced this pull request Oct 30, 2024

[hegeman/AWQ] Torch Int-4 AWQ Dequantization and Configuration Options (

4e9830e

vllm-project#146)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove redundant code in ColumnParallelLinear #146

Remove redundant code in ColumnParallelLinear #146

Uh oh!

WoosukKwon commented Jun 11, 2023 •

edited

Loading

Uh oh!

zhuohan123 left a comment

Uh oh!

Uh oh!

Uh oh!

Remove redundant code in ColumnParallelLinear #146

Remove redundant code in ColumnParallelLinear #146

Uh oh!

Conversation

WoosukKwon commented Jun 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WoosukKwon commented Jun 11, 2023 •

edited

Loading