-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] telemetry for oom occurences #30472
Conversation
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
cc @thomasdesr for security review. |
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll follow up with #30427.
cc @iycheng for the gcs side change.
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
…elemtrics Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Will defer to @iycheng for merge. |
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@@ -138,7 +138,7 @@ class RedisInternalKV : public InternalKVInterface { | |||
/// This implementation class of `InternalKVHandler`. | |||
class GcsInternalKVManager : public rpc::InternalKVHandler { | |||
public: | |||
explicit GcsInternalKVManager(std::unique_ptr<InternalKVInterface> kv_instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to make it a shared ptr? If it's not shared across threads, we shouldn't do this. kv_instance is belonged to GcsInternalKVManager only which makes things simpler.
I feel we shouldn't use kv instance here directly for stats. We should use general APIs provided by GCS (GlobalStateAcessor/GcsClient). I'm thinking about Raylet/CoreWorker which want to store the stats and we probably don't want to have different approaches to record the data. There is protocol for telemetry and now we have already got python version where core logical is written in py. I'd think in the future, it's going to be very hard to maintain (the same protocol in py/java/gcs/coreworker/raylet). |
We added a TODO to do the the way you suggested (gcs client). The reason it is a TODO right now to integrate with gcs client is because it is crashing / hang when we try to use it directly - we are not sure if it is because we are calling gcs from gcs. Jiajun was looking into this last Friday. The hope is to get something functional and migrate to the gcs client approach as soon as we figured out what was going on that prevented it. |
@clarng that sgtm then! Thanks for the explanation! The only comment is the unique_ptr/shared_ptr. |
Lmk when tests pass! |
Looks like it is passing now, some failures look unrelated (not worse than master) |
We added some basic Ray core metrics: num of actors created and num of pgs created. To achieve so, this PR introduced 3 changes (potentially we can split them): Move the usage_lib.py TagKey into usage.proto, and also added @pcmoritz / @thomasdesr as code owner. This makes the future change to telemetry easier to audit; and also allows us use telemetry across different languages. Introduced a Gcs/C++ version of usage_lib (GcsUsgageReporter). Its implementation is similar to [core] telemetry for oom occurences #30472. This make it easier to add telemetry from GCS. Added PlacementGroup and Actor telemetry code which collects the number of actors/pgs created during the lifetime of the cluster.
track how often oom triggers in the wild Signed-off-by: Clarence Ng <clarence.wyng@gmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
We added some basic Ray core metrics: num of actors created and num of pgs created. To achieve so, this PR introduced 3 changes (potentially we can split them): Move the usage_lib.py TagKey into usage.proto, and also added @pcmoritz / @thomasdesr as code owner. This makes the future change to telemetry easier to audit; and also allows us use telemetry across different languages. Introduced a Gcs/C++ version of usage_lib (GcsUsgageReporter). Its implementation is similar to [core] telemetry for oom occurences ray-project#30472. This make it easier to add telemetry from GCS. Added PlacementGroup and Actor telemetry code which collects the number of actors/pgs created during the lifetime of the cluster. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
We added some basic Ray core metrics: num of actors created and num of pgs created. To achieve so, this PR introduced 3 changes (potentially we can split them): Move the usage_lib.py TagKey into usage.proto, and also added @pcmoritz / @thomasdesr as code owner. This makes the future change to telemetry easier to audit; and also allows us use telemetry across different languages. Introduced a Gcs/C++ version of usage_lib (GcsUsgageReporter). Its implementation is similar to [core] telemetry for oom occurences #30472. This make it easier to add telemetry from GCS. Added PlacementGroup and Actor telemetry code which collects the number of actors/pgs created during the lifetime of the cluster.
We added some basic Ray core metrics: num of actors created and num of pgs created. To achieve so, this PR introduced 3 changes (potentially we can split them): Move the usage_lib.py TagKey into usage.proto, and also added @pcmoritz / @thomasdesr as code owner. This makes the future change to telemetry easier to audit; and also allows us use telemetry across different languages. Introduced a Gcs/C++ version of usage_lib (GcsUsgageReporter). Its implementation is similar to [core] telemetry for oom occurences ray-project#30472. This make it easier to add telemetry from GCS. Added PlacementGroup and Actor telemetry code which collects the number of actors/pgs created during the lifetime of the cluster. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Clarence Ng clarence.wyng@gmail.com
Why are these changes needed?
track how often oom triggers in the wild
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.