Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] telemetry for when memory monitor is enabled #29152

Closed
wants to merge 9 commits into from

Conversation

clarng
Copy link
Contributor

@clarng clarng commented Oct 7, 2022

Signed-off-by: Clarence Ng clarence.wyng@gmail.com

Why are these changes needed?

Add telemetry to record at worker init time whether the ray cluster is using memory monitor

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Manual test per https://docs.google.com/document/d/1XSWzx9ZatnlttsurCXzvV4vVsb2g1KxJ2T4zpf33bI8/edit

cat /tmp/ray/session_latest/usage_stats.json

{"usage_stats": {"ray_version": "3.0.0.dev0", "python_version": "3.7.7", "schema_version": "0.1", "source": "OSS", "session_id": "330fafc5-718b-48c5-92d0-dbe8dbc784a5", "git_commit": "{{RAY_COMMIT_SHA}}", "os": "linux", "collect_timestamp_ms": 1665116028212, "session_start_timestamp_ms": 1665115658078, "cloud_provider": null, "min_workers": null, "max_workers": null, "head_node_instance_type": null, "worker_node_instance_types": null, "total_num_cpus": 12, "total_num_gpus": null, "total_memory_gb": 10.672229005023837, "total_object_store_memory_gb": 5.336114501580596, "library_usages": ["dataset", "train", "tune"], "total_success": 302, "total_failed": 0, "seq_number": 302, "extra_usage_tags": {"gcs_storage": "memory", "memory_monitor_enabled": "false"}, "total_num_nodes": 1, "total_num_running_jobs": 1}, "success": true, "error": null}

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@clarng clarng requested review from scv119 and rkooo567 October 7, 2022 04:15
@clarng clarng marked this pull request as ready for review October 7, 2022 04:15
@scv119
Copy link
Contributor

scv119 commented Oct 7, 2022

need fix CI though

clarng added 3 commits October 6, 2022 21:54
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this from usage_lib.py? I think we can do this way;

  1. Generate a method that gets all internal config from GetInternalConfigRequest.
  2. When the dashboard first starts, we get internal config and update only relevant fields only once (in this case memory monitor config)

python/ray/_private/worker.py Outdated Show resolved Hide resolved
python/ray/_private/worker.py Outdated Show resolved Hide resolved
clarng added 4 commits October 7, 2022 10:18
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@rkooo567
Copy link
Contributor

Hmm I think it is not going to work if you do ray start --head. Can you add a test with this case?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 13, 2022
@rkooo567
Copy link
Contributor

I'd love to talk in person to talk about how I'd approach this

@clarng
Copy link
Contributor Author

clarng commented Oct 13, 2022

I'd love to talk in person to talk about how I'd approach this

sounds good, should be back on this very soon

…metrymm

Signed-off-by: Clarence Ng <clarence.wyng@gmail.com>
@clarng
Copy link
Contributor Author

clarng commented Nov 23, 2022

Fixed in #30472

@clarng clarng closed this Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants