Open
Description
openedon Sep 5, 2024
Motivation
Currently, it's hard to quickly attribute performance issues to a particular part of our I/O path (compute->safekeeper->pageserver).
We have a lot of metrics in the safekeeper and pageserver, but relative few in the compute. The compute is closest to the user, and can give us a clearer picture of what performance the user is experiencing, as well as enabling us to measure end-to-end performance including network latency to the compute.
DoD
- When we encounter a performance limit on the write or read path, we are able to say with confidence whether the bottleneck is on the compute or storage side
- When we see apparent slow getpage requests, we can distinguish between slowness inside the server, vs. slowness on the end-to-end path including network latency (by comparing server and client latencies)
Implementation ideas
Tasks
Other related tasks and Epics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment