-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[core][metric] Redefine gcs STATS using metric interface #56201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6fdf8c3
49848e7
acd38ed
b245af5
b230390
67cd514
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| // Copyright 2025 The Ray Authors. | ||
| // | ||
| // Licensed under the Apache License, Version 2.0 (the "License"); | ||
| // you may not use this file except in compliance with the License. | ||
| // You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, software | ||
| // distributed under the License is distributed on an "AS IS" BASIS, | ||
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
|
|
||
| #pragma once | ||
|
|
||
| #include "ray/stats/metric.h" | ||
|
|
||
| namespace ray { | ||
|
|
||
| inline ray::stats::Gauge GetActorByStateGaugeMetric() { | ||
| /// Tracks actors by state, including pending, running, and idle actors. | ||
| /// | ||
| /// To avoid metric collection conflicts between components reporting on the same actor, | ||
| /// we use the "Source" required label. | ||
| return ray::stats::Gauge{ | ||
| /*name=*/"actors", | ||
| /*description=*/ | ||
| "An actor can be in one of DEPENDENCIES_UNREADY, PENDING_CREATION, ALIVE, " | ||
| "ALIVE_IDLE, ALIVE_RUNNING_TASKS, RESTARTING, or DEAD states. " | ||
| "An actor is considered ALIVE_IDLE if it is not executing any tasks.", | ||
| /*unit=*/"", | ||
| // State: the actor state, which is from rpc::ActorTableData::ActorState, | ||
| // For ALIVE actor the sub-state can be IDLE, RUNNING_TASK, | ||
| // RUNNING_IN_RAY_GET, and RUNNING_IN_RAY_WAIT. | ||
| // Name: the name of actor class (Keep in sync with the TASK_OR_ACTOR_NAME_TAG_KEY | ||
| // in python/ray/_private/telemetry/metric_cardinality.py) Source: component | ||
| // reporting, e.g., "gcs" or "executor". | ||
| /*tag_keys=*/{"State", "Name", "Source", "JobId"}, | ||
| }; | ||
| } | ||
|
|
||
| } // namespace ray | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -165,8 +165,10 @@ JobID GetProcessJobID(const CoreWorkerOptions &options) { | |
| return options.job_id; | ||
| } | ||
|
|
||
| TaskCounter::TaskCounter(ray::observability::MetricInterface &task_by_state_counter) | ||
| : task_by_state_counter_(task_by_state_counter) { | ||
| TaskCounter::TaskCounter(ray::observability::MetricInterface &task_by_state_gauge, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should change the name of the type since it's now a gauge?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The internal Prometheus metric has always been a gauge, while the wrapper has always been TaskCounter. I fixed the naming in this PR from task_by_state_counter to task_by_state_gauge to correct a regression I introduced earlier, but underneath, it has always been a gauge. I can see the merit in naming the wrapper Counter, since the concept of a gauge might feel unfamiliar or like an implementation detail, at least to me. But I’m open to using either name for the wrapper, though elsewhere in the codebase Gauge and Counter are used interchangeably for wrapper names (like in [1]). |
||
| ray::observability::MetricInterface &actor_by_state_gauge) | ||
| : task_by_state_gauge_(task_by_state_gauge), | ||
| actor_by_state_gauge_(actor_by_state_gauge) { | ||
| counter_.SetOnChangeCallback( | ||
| [this](const std::tuple<std::string, TaskStatusType, bool> | ||
| &key) ABSL_EXCLUSIVE_LOCKS_REQUIRED(&mu_) mutable { | ||
|
|
@@ -181,31 +183,31 @@ TaskCounter::TaskCounter(ray::observability::MetricInterface &task_by_state_coun | |
| const auto is_retry_label = is_retry ? "1" : "0"; | ||
| // RUNNING_IN_RAY_GET/WAIT are sub-states of RUNNING, so we need to subtract | ||
| // them out to avoid double-counting. | ||
| task_by_state_counter_.Record( | ||
| task_by_state_gauge_.Record( | ||
| running_total - num_in_get - num_in_wait, | ||
| {{"State"sv, rpc::TaskStatus_Name(rpc::TaskStatus::RUNNING)}, | ||
| {"Name"sv, func_name}, | ||
| {"IsRetry"sv, is_retry_label}, | ||
| {"JobId"sv, job_id_}, | ||
| {"Source"sv, "executor"}}); | ||
| // Negate the metrics recorded from the submitter process for these tasks. | ||
| task_by_state_counter_.Record( | ||
| task_by_state_gauge_.Record( | ||
| -running_total, | ||
| {{"State"sv, rpc::TaskStatus_Name(rpc::TaskStatus::SUBMITTED_TO_WORKER)}, | ||
| {"Name"sv, func_name}, | ||
| {"IsRetry"sv, is_retry_label}, | ||
| {"JobId"sv, job_id_}, | ||
| {"Source"sv, "executor"}}); | ||
| // Record sub-state for get. | ||
| task_by_state_counter_.Record( | ||
| task_by_state_gauge_.Record( | ||
| num_in_get, | ||
| {{"State"sv, rpc::TaskStatus_Name(rpc::TaskStatus::RUNNING_IN_RAY_GET)}, | ||
| {"Name"sv, func_name}, | ||
| {"IsRetry"sv, is_retry_label}, | ||
| {"JobId"sv, job_id_}, | ||
| {"Source"sv, "executor"}}); | ||
| // Record sub-state for wait. | ||
| task_by_state_counter_.Record( | ||
| task_by_state_gauge_.Record( | ||
| num_in_wait, | ||
| {{"State"sv, rpc::TaskStatus_Name(rpc::TaskStatus::RUNNING_IN_RAY_WAIT)}, | ||
| {"Name"sv, func_name}, | ||
|
|
@@ -226,16 +228,16 @@ void TaskCounter::RecordMetrics() { | |
| } else { | ||
| running_tasks = 1.0; | ||
| } | ||
| ray::stats::STATS_actors.Record(idle, | ||
| {{"State", "ALIVE_IDLE"}, | ||
| {"Name", actor_name_}, | ||
| {"Source", "executor"}, | ||
| {"JobId", job_id_}}); | ||
| ray::stats::STATS_actors.Record(running_tasks, | ||
| {{"State", "ALIVE_RUNNING_TASKS"}, | ||
| {"Name", actor_name_}, | ||
| {"Source", "executor"}, | ||
| {"JobId", job_id_}}); | ||
| actor_by_state_gauge_.Record(idle, | ||
| {{"State"sv, "ALIVE_IDLE"}, | ||
| {"Name"sv, actor_name_}, | ||
| {"Source"sv, "executor"}, | ||
| {"JobId"sv, job_id_}}); | ||
| actor_by_state_gauge_.Record(running_tasks, | ||
| {{"State"sv, "ALIVE_RUNNING_TASKS"}, | ||
| {"Name"sv, actor_name_}, | ||
| {"Source"sv, "executor"}, | ||
| {"JobId"sv, job_id_}}); | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -303,7 +305,8 @@ CoreWorker::CoreWorker( | |
| instrumented_io_context &task_execution_service, | ||
| std::unique_ptr<worker::TaskEventBuffer> task_event_buffer, | ||
| uint32_t pid, | ||
| ray::observability::MetricInterface &task_by_state_counter) | ||
| ray::observability::MetricInterface &task_by_state_gauge, | ||
| ray::observability::MetricInterface &actor_by_state_gauge) | ||
| : options_(std::move(options)), | ||
| get_call_site_(RayConfig::instance().record_ref_creation_sites() | ||
| ? options_.get_lang_stack | ||
|
|
@@ -342,7 +345,7 @@ CoreWorker::CoreWorker( | |
| task_execution_service_(task_execution_service), | ||
| exiting_detail_(std::nullopt), | ||
| max_direct_call_object_size_(RayConfig::instance().max_direct_call_object_size()), | ||
| task_counter_(task_by_state_counter), | ||
| task_counter_(task_by_state_gauge, actor_by_state_gauge), | ||
| task_event_buffer_(std::move(task_event_buffer)), | ||
| pid_(pid), | ||
| actor_shutdown_callback_(options_.actor_shutdown_callback), | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this. Right now we have component labels for things like raylet and gcs and we have a Name label for a metric name. Now we're adding a source label. Why would two components necessarily conflict in the current set up? Are they within the same component? I'm unclear why we need an additional label now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about Source vs. Component. I’m also not sure of the original purpose of each, these tags predate both my time and this PR (which is purely a refactoring and doesn’t add the Source tag). I’m open to revisiting or merging the two tags in a follow-up PR.