-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] (cgroups 6/n) CgroupManager cleans up the entire cgroup hierarchy in reverse order when destroyed. #56260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
05c4dbc
645f9a0
2bb2c5b
4793094
85d0ebf
3a5a020
68b0c93
f52354b
148d04d
d3f8b79
d76ff15
2798ea5
3fda505
9e1e931
f066f34
e6b4926
f4e0cb2
544ba83
669ba99
2e341d6
9e46ce6
7c745c5
d7eb863
ff64534
da4b475
37e205f
7b83932
b911d25
63506dc
e6f1ae9
ead9de1
d0bcf4d
08c36d8
758955a
e59ac62
8866592
5364a1d
c77e1f8
fe54541
67b21d5
6e6bc32
c399d45
2cb4f6e
dd25a97
4a95598
cc51788
d31eb1a
d43a5d3
a458406
01023b9
bb5d866
f4a8553
17d1008
3423eab
5357ea3
1ecfdda
17c07da
e044fcd
b59dbc4
13eee38
f698183
3b37051
946ec90
ca63baa
0fe9113
398ef88
ca83426
f7f04db
dfd9b07
1884da5
e0bbac8
2457558
36101f4
a145a81
fc85704
b5f6c5e
03c731e
4aeabf4
44a5844
4de334c
89f49e5
dd0bf98
eee8982
fd9ef0d
1c17e3f
4a581d7
df8f925
0059039
e386cb5
cc3af83
161dd95
92bb7ce
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -113,48 +113,65 @@ StatusOr<std::unique_ptr<CgroupManager>> CgroupManager::Create( | |
| return cgroup_manager; | ||
| } | ||
|
|
||
| // TODO(#54703): This is a placeholder for cleanup. This will call | ||
| // CgroupDriver::DeleteCgroup. | ||
| void CgroupManager::RegisterDeleteCgroup(const std::string &cgroup_path) { | ||
| cleanup_operations_.emplace_back([cgroup = cgroup_path]() { | ||
| RAY_LOG(INFO) << absl::StrFormat("Deleting all cgroup %s.", cgroup); | ||
| cleanup_operations_.emplace_back([this, cgroup = cgroup_path]() { | ||
| Status s = this->cgroup_driver_->DeleteCgroup(cgroup); | ||
| if (!s.ok()) { | ||
| RAY_LOG(WARNING) << absl::StrFormat( | ||
| "Failed to delete cgroup %s with error %s.", cgroup, s.ToString()); | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| // TODO(#54703): This is a placeholder for cleanup. This will call | ||
| // CgroupDriver::MoveAllProcesses. | ||
| void CgroupManager::RegisterMoveAllProcesses(const std::string &from, | ||
| const std::string &to) { | ||
| cleanup_operations_.emplace_back([from_cgroup = from, to_cgroup = to]() { | ||
| RAY_LOG(INFO) << absl::StrFormat( | ||
| "Moved All Processes from %s to %s.", from_cgroup, to_cgroup); | ||
| cleanup_operations_.emplace_back([this, from_cgroup = from, to_cgroup = to]() { | ||
| Status s = this->cgroup_driver_->MoveAllProcesses(from_cgroup, to_cgroup); | ||
| if (!s.ok()) { | ||
| RAY_LOG(WARNING) << absl::StrFormat( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: WARNING vs. ERROR? Also, what is the potential implication of this failing? If we're unable to move processes into their appropriate cgroup, then.... what, they'll just run without the resource limits we'd otherwise like imposted? Should we keep running in that scenario or bail out?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this line of questioning applies to delete too, but.... process move seems more concerning.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is during clean up. When CgroupManager starts up with a The reason for doing this is that the default cgroup path, and the most commonly anticipated pattern, will be to run this with However, the root cgroup for the container is not the actual root cgroup of the operating system. This means that it is subject to the No Internal Processes constraint. It cannot have child cgroups and a controller enabled if it has processes inside it. The implication for clean up is that the cgroup hierarchy will not be deleted properly. This is mostly fine when it's running inside a container. Given this, maybe ERROR makes more sense? |
||
| "Failed to move all processes from %s to %s with error %s", | ||
| from_cgroup, | ||
| to_cgroup, | ||
| s.ToString()); | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| // TODO(#54703): This is a placeholder for cleanup. This will call | ||
| // CgroupDriver::AddConstraint(cgroup, constraint, default_value). | ||
| template <typename T> | ||
| void CgroupManager::RegisterRemoveConstraint(const std::string &cgroup, | ||
| const Constraint<T> &constraint) { | ||
| cleanup_operations_.emplace_back( | ||
| [constrained_cgroup = cgroup, constraint_to_remove = constraint]() { | ||
| RAY_LOG(INFO) << absl::StrFormat( | ||
| "Setting constraint %s to default value %lld for cgroup %s", | ||
| constraint_to_remove.name_, | ||
| constraint_to_remove.default_value_, | ||
| constrained_cgroup); | ||
| [this, constrained_cgroup = cgroup, constraint_to_remove = constraint]() { | ||
| std::string default_value = std::to_string(constraint_to_remove.default_value_); | ||
| Status s = this->cgroup_driver_->AddConstraint(constrained_cgroup, | ||
| constraint_to_remove.controller_, | ||
| constraint_to_remove.name_, | ||
| default_value); | ||
| if (!s.ok()) { | ||
| RAY_LOG(WARNING) << absl::StrFormat( | ||
| "Failed to set constraint %s=%s to default value for cgroup %s with error " | ||
| "%s.", | ||
| constraint_to_remove.name_, | ||
| default_value, | ||
| constrained_cgroup, | ||
| s.ToString()); | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| // TODO(#54703): This is a placeholder for cleanup. This will call | ||
| // CgroupDriver::DisableController. | ||
| void CgroupManager::RegisterDisableController(const std::string &cgroup, | ||
| void CgroupManager::RegisterDisableController(const std::string &cgroup_path, | ||
| const std::string &controller) { | ||
| cleanup_operations_.emplace_back([cgroup_to_clean = cgroup, | ||
| controller_to_disable = controller]() { | ||
| RAY_LOG(INFO) << absl::StrFormat( | ||
| "Disabling controller %s for cgroup %s.", controller_to_disable, cgroup_to_clean); | ||
| }); | ||
| cleanup_operations_.emplace_back( | ||
| [this, cgroup = cgroup_path, controller_to_disable = controller]() { | ||
| Status s = this->cgroup_driver_->DisableController(cgroup, controller_to_disable); | ||
| if (!s.ok()) { | ||
| RAY_LOG(WARNING) << absl::StrFormat( | ||
| "Failed to disable controller %s for cgroup %s with error %s", | ||
| controller_to_disable, | ||
| cgroup, | ||
| s.ToString()); | ||
| } | ||
| }); | ||
| } | ||
|
|
||
| Status CgroupManager::Initialize(int64_t system_reserved_cpu_weight, | ||
|
|
@@ -168,11 +185,11 @@ Status CgroupManager::Initialize(int64_t system_reserved_cpu_weight, | |
| cpu_weight_constraint_.Max() - system_reserved_cpu_weight; | ||
|
|
||
| RAY_LOG(INFO) << absl::StrFormat( | ||
| "Initializing CgroupManager at base cgroup path at %s. Ray's cgroup " | ||
| "hierarchy will under the node cgroup %s. The %s controllers will be " | ||
| "Initializing CgroupManager at base cgroup at '%s'. Ray's cgroup " | ||
| "hierarchy will under the node cgroup at '%s'. The %s controllers will be " | ||
| "enabled. " | ||
| "System cgroup %s will have constraints [%s=%lld, %s=%lld]. " | ||
| "Application cgroup %s will have constraints [%s=%lld].", | ||
| "The system cgroup at '%s' will have constraints [%s=%lld, %s=%lld]. " | ||
| "The application cgroup '%s' will have constraints [%s=%lld].", | ||
| base_cgroup_path_, | ||
| node_cgroup_path_, | ||
| supported_controllers, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why remove the log? (I won't insist on bringing the log back, just curious to the motivation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it in by mistake from the previous PR. I was using it to debug a test.