Skip to content

Conversation

@can-anyscale
Copy link
Contributor

@can-anyscale can-anyscale commented Sep 3, 2025

This PR is in the series of unifying all metric definition infra.

This PR migrates all GCS metrics to use the metric interface. It does that by creating the metric object inside gcs_server and pass them down as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than the number of file changed tells you.

Test:

  • CI

Note

Refactors GCS and core worker to use injected MetricInterface objects for all metrics, adding new metric helpers and rewiring constructors, server startup, storage client, and tests accordingly.

  • Metrics Infrastructure:
    • Introduce metric helpers in src/ray/common/metrics.h and src/ray/gcs/metrics.h (gauges/histograms/counters for actors, jobs, placement groups, task events, and GCS storage).
    • Replace direct stats usage with MetricInterface across GCS and core worker; rename helpers (e.g., GetTaskMetric -> GetTaskByStateGaugeMetric, GetRayEventRecorderDroppedEventsMetric -> GetRayEventRecorderDroppedEventsCounterMetric).
  • GCS Server Refactor:
    • GcsServer now constructs/accepts metric instances and passes them to subcomponents via Start/DoStart and init methods.
    • GcsActorManager, GcsJobManager, GcsPlacementGroupManager, and GcsTaskManager constructors updated to receive and record via MetricInterface.
    • ObservableStoreClient wraps delegate and records storage metrics via injected interfaces.
  • Core Worker:
    • TaskCounter and CoreWorker updated to use task/actor state gauges via injected MetricInterface.
  • Tests/Mocks/Build:
    • Update mocks and tests to use FakeGauge/FakeCounter/FakeHistogram; validate metric tags/values.
    • Add Bazel targets/deps for new metric headers and fakes; minor BUILD wiring adjustments.

Written by Cursor Bugbot for commit bd5ff5a. This will update automatically on new commits. Configure here.

@can-anyscale can-anyscale force-pushed the can-metric01 branch 4 times, most recently from aa8a215 to 85e420f Compare September 5, 2025 21:53
Base automatically changed from can-metric01 to master September 8, 2025 17:37
@can-anyscale can-anyscale force-pushed the can-metric02 branch 2 times, most recently from 587ed3b to bf0b3ff Compare September 11, 2025 19:07
@can-anyscale can-anyscale changed the title [core][metric] Redefine more STATS using metric interface [core][metric] Redefine gcs STATS using metric interface Sep 11, 2025
@can-anyscale can-anyscale marked this pull request as ready for review September 11, 2025 19:08
@can-anyscale can-anyscale requested a review from a team as a code owner September 11, 2025 19:08
@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Sep 11, 2025
@can-anyscale can-anyscale marked this pull request as draft September 11, 2025 23:11
@can-anyscale can-anyscale force-pushed the can-metric02 branch 7 times, most recently from ef046b6 to b22516d Compare September 23, 2025 21:56
@can-anyscale can-anyscale force-pushed the can-metric02 branch 3 times, most recently from e53cd65 to be370f9 Compare September 24, 2025 16:19
inline ray::stats::Gauge GetActorMetric() {
/// Tracks actors by state, including pending, running, and idle actors.
///
/// To avoid metric collection conflicts between components reporting on the same actor,
Copy link
Contributor

@ZacAttack ZacAttack Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this. Right now we have component labels for things like raylet and gcs and we have a Name label for a metric name. Now we're adding a source label. Why would two components necessarily conflict in the current set up? Are they within the same component? I'm unclear why we need an additional label now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about Source vs. Component. I’m also not sure of the original purpose of each, these tags predate both my time and this PR (which is purely a refactoring and doesn’t add the Source tag). I’m open to revisiting or merging the two tags in a follow-up PR.


TaskCounter::TaskCounter(ray::observability::MetricInterface &task_by_state_counter)
: task_by_state_counter_(task_by_state_counter) {
TaskCounter::TaskCounter(ray::observability::MetricInterface &task_by_state_gauge,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change the name of the type since it's now a gauge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal Prometheus metric has always been a gauge, while the wrapper has always been TaskCounter. I fixed the naming in this PR from task_by_state_counter to task_by_state_gauge to correct a regression I introduced earlier, but underneath, it has always been a gauge. I can see the merit in naming the wrapper Counter, since the concept of a gauge might feel unfamiliar or like an implementation detail, at least to me. But I’m open to using either name for the wrapper, though elsewhere in the codebase Gauge and Counter are used interchangeably for wrapper names (like in [1]).

auto counters = stats_counter_.GetAll();
ray::stats::STATS_gcs_task_manager_task_events_reported.Record(
counters[kTotalNumTaskEventsReported]);
task_events_reported_gauge_.Record(counters[kTotalNumTaskEventsReported]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1]

@can-anyscale can-anyscale force-pushed the can-metric02 branch 2 times, most recently from a56c143 to cabfde9 Compare September 24, 2025 22:08
@can-anyscale can-anyscale force-pushed the can-metric02 branch 2 times, most recently from 2afc068 to 526b745 Compare October 3, 2025 22:26
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <can@anyscale.com>
@can-anyscale can-anyscale merged commit 534b0e4 into master Oct 6, 2025
6 checks passed
@can-anyscale can-anyscale deleted the can-metric02 branch October 6, 2025 18:37
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…#56201)

This PR is in the series of unifying all metric definition infra.

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
>
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…#56201)

This PR is in the series of unifying all metric definition infra.

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
>
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…#56201)

This PR is in the series of unifying all metric definition infra. 

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
> 
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…#56201)

This PR is in the series of unifying all metric definition infra. 

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
> 
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…#56201)

This PR is in the series of unifying all metric definition infra. 

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
> 
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
eicherseiji pushed a commit to eicherseiji/ray that referenced this pull request Oct 6, 2025
…#56201)

This PR is in the series of unifying all metric definition infra. 

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
> 
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
aslonnie added a commit that referenced this pull request Oct 7, 2025
liulehui pushed a commit to liulehui/ray that referenced this pull request Oct 9, 2025
…#56201)

This PR is in the series of unifying all metric definition infra. 

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
> 
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
…#56201)

This PR is in the series of unifying all metric definition infra.

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
>
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Josh Kodi <joshkodi@gmail.com>
joshkodi pushed a commit to joshkodi/ray that referenced this pull request Oct 13, 2025
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…#56201)

This PR is in the series of unifying all metric definition infra. 

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
> 
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…#56201)

This PR is in the series of unifying all metric definition infra. 

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
> 
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…#56201)

This PR is in the series of unifying all metric definition infra.

This PR migrates all GCS metrics to use the metric interface. It does
that by creating the metric object inside gcs_server and pass them down
as interfaces to sub-components.

Purely refactoring code and repetitive patterns, easier to review than
the number of file changed tells you.

Test:
- CI

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Refactors GCS and core worker to use injected MetricInterface objects
for all metrics, adding new metric helpers and rewiring constructors,
server startup, storage client, and tests accordingly.
>
> - **Metrics Infrastructure**:
> - Introduce metric helpers in `src/ray/common/metrics.h` and
`src/ray/gcs/metrics.h` (gauges/histograms/counters for actors, jobs,
placement groups, task events, and GCS storage).
> - Replace direct `stats` usage with `MetricInterface` across GCS and
core worker; rename helpers (e.g., `GetTaskMetric` ->
`GetTaskByStateGaugeMetric`, `GetRayEventRecorderDroppedEventsMetric` ->
`GetRayEventRecorderDroppedEventsCounterMetric`).
> - **GCS Server Refactor**:
> - `GcsServer` now constructs/accepts metric instances and passes them
to subcomponents via `Start`/`DoStart` and init methods.
> - `GcsActorManager`, `GcsJobManager`, `GcsPlacementGroupManager`, and
`GcsTaskManager` constructors updated to receive and record via
`MetricInterface`.
> - `ObservableStoreClient` wraps delegate and records storage metrics
via injected interfaces.
> - **Core Worker**:
> - `TaskCounter` and `CoreWorker` updated to use task/actor state
gauges via injected `MetricInterface`.
> - **Tests/Mocks/Build**:
> - Update mocks and tests to use
`FakeGauge`/`FakeCounter`/`FakeHistogram`; validate metric tags/values.
> - Add Bazel targets/deps for new metric headers and fakes; minor BUILD
wiring adjustments.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
bd5ff5a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants