Description
Problem
We need to have metrics to understand the health of the dstack system, including the dstack server, database and the infrastructure the jobs run on. The current dstack metrics are focused more on usage rather than system health.
Solution
As a starting point, I'm proposing we add at least the basic http server metrics, such as request/response latency and error rate broken by dstack operation.
Some additional metrics to consider: the amount of time a run is pending, latency from an apply request is accepted to job start, DB health metrics like connections error rate and latencies.
Workaround
We're able to capture some of the http metrics from our load balancer but they're not granular enough. It's useful to see which dstack operations are failing for example or taking a long time to execute. The other metrics need to come from dstack so there is no workaround.
Would you like to help us implement this feature by sending a PR?
Yes