Description
Hi! Thank you for your amazing framework! I have tried serving a GPT BigCode model using vllm together with ray following the example: https://github.com/ray-project/ray/blob/3d3183d944424a960a2c6ce048abd1316c901c1e/doc/source/serve/doc_code/vllm_example.py And in my use case the response is in "non-streaming" format. I directly passed the request to the vllm async engine to use the continuous batching ability. However, when I tested it with stress testing tool, I found the improvement of the latency and throughput is not that good.
One reason behind might be that the average length of testing input prompt is quite long (around 1000 tokens) which uses almost all space of KV cache in GPU and some of them are duplicated. Therefore I may need to do a preprocessing of the request in the batch level first to filter out some duplicate requests then pass the request to vllm engine. Currently, if I want to use async engine, I can only pass one prompt to the pool at one time. May I know if you have the plan to add a function like allowing adding multi prompts into the pool at one time (each with a different request_id) and retrieve them based on request_id?