Skip to content

[Feature]: Health metrics #2736

Open
Open
@Nadine-H

Description

@Nadine-H

Problem

We need to have metrics to understand the health of the dstack system, including the dstack server, database and the infrastructure the jobs run on. The current dstack metrics are focused more on usage rather than system health.

Solution

As a starting point, I'm proposing we add at least the basic http server metrics, such as request/response latency and error rate broken by dstack operation.

Some additional metrics to consider: the amount of time a run is pending, latency from an apply request is accepted to job start, DB health metrics like connections error rate and latencies.

Workaround

We're able to capture some of the http metrics from our load balancer but they're not granular enough. It's useful to see which dstack operations are failing for example or taking a long time to execute. The other metrics need to come from dstack so there is no workaround.

Would you like to help us implement this feature by sending a PR?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions