[Question]: MInference Pre filling is slower than the vllm original version

### Describe the issue

code :
```python
# Copyright (c) 2024 Microsoft
# Licensed under The MIT License [see LICENSE for details]

from vllm import LLM, SamplingParams

from minference import MInference
import time

def read_content_from_file(file_path, num_chars=5000):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read(num_chars)
        return content
    except FileNotFoundError:
        logging.error(f"File {file_path} not found.")
        return ""
    except Exception as e:
        logging.error(f"An error occurred while reading the file: {e}")
        return ""

content = read_content_from_file("./question.txt", 12000) + ",请总结上面的故事梗概。"

prompts = []
for _ in range(50):
    prompts.extend([content])

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1,
)
model_name = "/xxx/model/Qwen2-7B-Instruct"
llm = LLM(
    model_name,
    max_num_seqs=1,
    enforce_eager=True,
    tensor_parallel_size=1,
    max_model_len=128000,
)



start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"vllm Generating text took {elapsed_time:.2f} seconds.")



# Patch MInference Module
minference_patch = MInference("vllm", model_name)
llm = minference_patch(llm)

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"minference Generating text took {elapsed_time:.2f} seconds.")


```
results of execution：

Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:41<00:00,  1.22it/s]
vllm Generating text took 41.57 seconds.
Patched model for minference with vLLM..
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:34<00:00,  1.90s/it]
minference Generating text took 95.37 seconds.

why minference slower than vllm 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: MInference Pre filling is slower than the vllm original version #18

Describe the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question]: MInference Pre filling is slower than the vllm original version #18

Description

Describe the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions