Epic: compute performance observability (as a storage client)

## Motivation

Currently, it's hard to quickly attribute performance issues to a particular part of our I/O path (compute->safekeeper->pageserver).

We have a lot of metrics in the safekeeper and pageserver, but relative few in the compute.  The compute is closest to the user, and can give us a clearer picture of what performance the user is experiencing, as well as enabling us to measure end-to-end performance including network latency to the compute.

## DoD

- When we encounter a performance limit on the write or read path, we are able to say with confidence whether the bottleneck is on the compute or storage side
- When we see apparent slow getpage requests, we can distinguish between slowness inside the server, vs. slowness on the end-to-end path including network latency (by comparing server and client latencies)

## Implementation ideas

## Tasks
```[tasklist]
- [ ] https://github.com/neondatabase/neon/pull/9008
- [ ] https://github.com/neondatabase/neon/pull/9116
- [ ] Prometheus histogram metrics for getpage latency. (Peter: I would like to have it per shard)
- [ ] Prometheus metric for depth of queue of WAL being sent to safekeepers (i.e. measure how well safekeeper is keeping up.  We expect the safekeeper to service writes quickly, so this should always be low.  Spikes may indicates periods of unavailability)
- [ ] Prometheus metric for depth of queue of WAL not yet applied (i.e. measure how well pageserver is keeping up, when this metric hits the backpressure threshold we will throttle).  We expect this to be longer than the safekeeper/walproposer queue depth, but it should stay well below the backpressure threshold
- [ ] Prometheus metric for time spent throttling due to WAL backpressure.  When this is nonzero it indicates the overall system is not keeping up.
- [ ] Track safekeeper commit latency?
```

## Other related tasks and Epics
- 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: compute performance observability (as a storage client) #8926

jcsp
openedon Sep 5, 2024

Motivation

DoD

Implementation ideas

Tasks

Tasks

Other related tasks and Epics

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Epic: compute performance observability (as a storage client) #8926

Description

jcspopenedon Sep 5, 2024

Motivation

DoD

Implementation ideas

Tasks

Tasks

Other related tasks and Epics

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

jcsp
openedon Sep 5, 2024