Skip to content

[Question]: MInference Pre filling is slower than the vllm original version #18

@junior-zsy

Description

@junior-zsy

Describe the issue

code :

# Copyright (c) 2024 Microsoft
# Licensed under The MIT License [see LICENSE for details]

from vllm import LLM, SamplingParams

from minference import MInference
import time

def read_content_from_file(file_path, num_chars=5000):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read(num_chars)
        return content
    except FileNotFoundError:
        logging.error(f"File {file_path} not found.")
        return ""
    except Exception as e:
        logging.error(f"An error occurred while reading the file: {e}")
        return ""

content = read_content_from_file("./question.txt", 12000) + ",请总结上面的故事梗概。"

prompts = []
for _ in range(50):
    prompts.extend([content])

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1,
)
model_name = "/xxx/model/Qwen2-7B-Instruct"
llm = LLM(
    model_name,
    max_num_seqs=1,
    enforce_eager=True,
    tensor_parallel_size=1,
    max_model_len=128000,
)



start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"vllm Generating text took {elapsed_time:.2f} seconds.")



# Patch MInference Module
minference_patch = MInference("vllm", model_name)
llm = minference_patch(llm)

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"minference Generating text took {elapsed_time:.2f} seconds.")

results of execution:

Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:41<00:00, 1.22it/s]
vllm Generating text took 41.57 seconds.
Patched model for minference with vLLM..
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:34<00:00, 1.90s/it]
minference Generating text took 95.37 seconds.

why minference slower than vllm

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions