[Feature]: Health metrics

### Problem

We need to have metrics to understand the health of the dstack system, including the dstack server, database and the infrastructure the jobs run on. The current dstack metrics are focused more on usage rather than system health.

### Solution

As a starting point, I'm proposing we add at least the basic http server metrics, such as request/response latency and error rate broken by dstack operation.

Some additional metrics to consider: the amount of time a run is pending, latency from an apply request is accepted to job start, DB health metrics like connections error rate and latencies.

### Workaround

We're able to capture some of the http metrics from our load balancer but they're not granular enough. It's useful to see which dstack operations are failing for example or taking a long time to execute. The other metrics need to come from dstack so there is no workaround.

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Health metrics #2736

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Health metrics #2736

Description

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions