DLPX-72326 Use OOMD instead of the kernel's OOM killer #297
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context:
Link: https://facebookmicrosites.github.io/oomd/
OOMD only works with 5.0+ kernels because it needs the PSI kernel interface (we should be good in terms of that) and cgroups V2 hierarchies (I think we are currently V1 so we may have to look into that). The proposal to use this is to overcome difficulties that we have with the current in-kernel OOM killer. Currently when a process dies by the OOM killer:
OOMD is very flexible and configurable. It can deal with all of the above by being configured to send SIGABRTs (and generate crash dumps) when processes take too much memory, take a system-wide memory snapshot before killing a process, and enforce per-process memory limits.
This Patch:
Enables CGROUPs v2 for the appstack by setting a kernel parameter during boot through GRUB.
Testing:
I created a VM and passed it that kernel parameter. The I manually checked that we had switched to cgroupsv2 by looking at the cgroup directory under sysfs.
= Before the patch:
= After the patch (
memory
directory is gone as it is unified under the top-levelcgroup
directory in v2):The only service that had a problem starting was our existing OOM service which we wrote with the assumption of using cgroups v1:
So I disabled it and run dx-test on that VM:
dx-test:
http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/dx-integration-tests/23953/
the BB section of the above failed because of how I specified my VM
sd-bpf
instead ofsd-bpf.dcol2
so I re-run the BB part:http://selfservice.jenkins.delphix.com/job/devops-gate/job/master/job/blackbox-self-service/89831/
Future Work
I still need to:
docker
used by the vSDK project (cc: @nhlien93)