Open
Description
Proposal to improve performance
Hi team,
I've been conducting performance tests on vllm PD Disaggregation using mooncake_store_connector, and found that the most time-consuming parts are not the actual put() operations, but rather:
Based on profiling traces, these two steps dominate the runtime during PD disaggregation, more than the actual storage or network transmission:
Observations:
tensorhash() seems to repeatedly compute SHA256 hashes over possibly large tensors.
safetensor_save() is used per tensor and appears to serialize, which is expensive when invoked frequently.
Questions:
Maybe we could parallelize the hash computation using multithreading?
Is there any alternatives for safetensor_save()?
Thanks!
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.