Skip to content

Commit 386817b

Browse files
whx-sjtuhw_whx
andauthored
[Model Runner][Performance] Cache the jugement result of is_encoder_decoder to decrease framework overhead (vllm-project#138)
In Model Runner, is_encoder_decoder is exacted from model_config to determin whether vllm is running for enc-dec models. Obtaining this status requires a long call stack, and the CPU overhead is high. So this PR cache this status in __init__ of ModelInputForNPUBuilder. Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>
1 parent d21b3be commit 386817b

File tree

1 file changed

+3
-2
lines changed

1 file changed

+3
-2
lines changed

vllm_ascend/model_runner.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,7 @@ def __init__(self,
353353
self.multi_modal_input_mapper = self.runner.multi_modal_input_mapper
354354
self.finished_requests_ids = finished_requests_ids
355355
self.decode_only = True
356+
self.is_encoder_decoder = self.runner.model_config.is_encoder_decoder
356357

357358
# Attention metadata inputs.
358359
self.attn_metadata_builder = self.attn_backend.make_metadata_builder(
@@ -423,7 +424,7 @@ def add_seq_group(self, seq_group_metadata: SequenceGroupMetadata):
423424

424425
encoder_seq_len = 0
425426

426-
if self.runner.model_config.is_encoder_decoder:
427+
if self.is_encoder_decoder:
427428
encoder_seq_len = seq_group_metadata.encoder_seq_data.get_len()
428429

429430
inter_data = self.init_cached_inter_data(
@@ -560,7 +561,7 @@ def _compute_lens(self, inter_data: InterDataForSeqGroup, seq_idx: int,
560561
context_len = seq_data.get_num_computed_tokens()
561562
seq_len = min(seq_len, context_len + token_chunk_size)
562563
elif self.runner.scheduler_config.is_multi_step or \
563-
self.runner.model_config.is_encoder_decoder:
564+
self.is_encoder_decoder:
564565
context_len = seq_len - 1
565566
else:
566567
context_len = seq_data.get_num_computed_tokens()

0 commit comments

Comments
 (0)