Skip to content

Possibility of Passing Prompts as List[str] to AsyncEngine.generate() #279

Closed
@FerdinandZhong

Description

@FerdinandZhong

Hi! Thank you for your amazing framework! I have tried serving a GPT BigCode model using vllm together with ray following the example: https://github.com/ray-project/ray/blob/3d3183d944424a960a2c6ce048abd1316c901c1e/doc/source/serve/doc_code/vllm_example.py And in my use case the response is in "non-streaming" format. I directly passed the request to the vllm async engine to use the continuous batching ability. However, when I tested it with stress testing tool, I found the improvement of the latency and throughput is not that good.

One reason behind might be that the average length of testing input prompt is quite long (around 1000 tokens) which uses almost all space of KV cache in GPU and some of them are duplicated. Therefore I may need to do a preprocessing of the request in the batch level first to filter out some duplicate requests then pass the request to vllm engine. Currently, if I want to use async engine, I can only pass one prompt to the pool at one time. May I know if you have the plan to add a function like allowing adding multi prompts into the pool at one time (each with a different request_id) and retrieve them based on request_id?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions