-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Query cgroup memory data and fuse with /proc data for more reliable information. #50
fix: Query cgroup memory data and fuse with /proc data for more reliable information. #50
Conversation
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/175174574 The labels on this github issue will be updated when the story is started. |
Thanks much for submitting this PR. Apologies for taking so long to get to it -- there have been a few reorgs, so things have been a bit hectic here. There are a few things that I'd like to talk about. Firstly, this code change needs tests. We can't accept this PR without good tests. Secondly, I notice that a) Consider the scenario where there's not enough free RAM on the system to handle full utilization of the remaining "free" RAM as calculated from data in the cgroup. Is that something that you should be concerned about? If not, then how is the used RAM statistic actually expected to be used? I used sections 5.5 and 5.2 of this document as a reference for understanding the contents of |
Agreed. At the time I was unsure how to actually test these changes, and punted. I will now do the same for
I currently assume that you are talking about Line 113 in 62465d5
Is that correct ? If yes, then the reasoning was that if cgroup claims that the application has more RAM for its use than the system has physical RAM, then cgroup's data is obviously bogus and should be ignored, and the host RAM is the limit. This came from the fact the linux boxes I had access to reported an extremely large limit outside of a container, and for containers with no limit set also. Example:
I.e. the reported limit is 8 EiB (Exa!). The physical RAM of that box is 64 GiB. And with this coming back to me, this was also part of the reason to not simply use cgroup data.
I have trouble understanding the question.
Looking at two different systems (ubuntu 20.04, kernel Checking further I see that
However the So the general gist of the statement looks to be true, if you meant the So, regarding
I can agree to that (rss ->
I was indeed not willing to assume that Part of the complexity is that I was paranoid enough to not assume that presence of If we make that assumption, or similar that we will have both of I will now go forward with:
|
Updated the PR as I proposed in my comments. |
Hi @klakin-pivotal, is there anything else you are expecting beyond what @andreas-kupries has added in 6edab8d ? I was getting excited to see some momentum on this issue and hope it won't peter out before it is resolved. 😄 |
Yeah,
I tested on a system that did not have any child cgroups, so the top-level values matched the total values. So, your code looks good, but I think that we've two fundamental problems:
On Ubuntu 4.15.0-136, Gentoo 5.4.38 and 4.16.9 I see a However, I spun up a Ubuntu 20.04 machine, and there is no It seems that -for some reason- Ubuntu 20.04 (and some other systems) are using the cgroup v2 memory file layout, rather than the cgroup v1 layout. The docs say when using the cgroup v2 layout, swap usage will appear in a Unfortunately, your code will need to handle all of these cases.
I think that the correct way to do this is:
I believe that reading
NOTE |
Ok. Thanks for the additional review and comments. I will now work on fixing the issues you mentioned. Starting with using |
That matches what I see with my 20.04 boxes
I have no Hm ? Can you provide a link to the docs you read ? I.e. location of I found Of course, none of these can be found on my 20.04 box either (That would be too easy). Hm ... And https://www.kernel.org/doc/Documentation/cgroup-v2.txt also only talks about Have you managed to find examples for the content of |
Five minutes after posting a
Now checking if my machines have swap accounting active. |
With swap accounting active we should see/have
Further looking at
Ditto/similar for the ubuntu box. Strong indications that I do not have swap accounting active. Will now activate that and see if it makes a difference. |
With swap accounting activated the My conclusion is that kernel 5.3.18 and 5.4.0 are still using cgroup v1, not v2. At this point I believe that support for cgroup v2 can be defered until we actually have a machine using it. |
If you can agree to the deferral of cgroup v2 support the PR should be ready again. |
... More information ... Thanks to feedback from Doing that with Ubuntu (to keep the work box up) the file I suspect that It should be possible to add the v2 support with that. |
@klakin-pivotal @jandubois Added support for cgroup v2 now. Reorganized the internals a fair bit. The data extraction functions for the main pieces all have the prefix The tests got extended to cover these new extraction functions, and the v2 support. Please review. |
I haven't run the tests, but I like your code so far. However the code has one major issue, and I've a few minor quibbles. The major issue: (And yeah, I spent like half a day trying to figure out if there was a better way to find the v1 The minor quibbles: What's the deal with this commented-out code? Should it be removed, un-commented and worked on? https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R410-R412 Is it safe to not check the error return of (I also notice that you are checking the error return of I'm fairly certain that the "global controller" should be called the "cgroup v2 memory controller": https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R470 Thanks for sticking with this. |
Commit c7bedf3 now searches by default, and uses the conventional locations as last-stop fallback.
I decided to go with
Ditto
Done, commit 7ef1556
Done, removed. Part of commit 86c8dc5
Commit 86c8dc5
Well, memory usage is a required thing. Swap usage missing on the other hand we can work around.
Commit 49c490a |
Okay, I looked over (and ran) the tests. I think we have three things left before I can attempt to merge this in and make a release:
The v2 cgroup controller will -apparently- always be controller id
While searching for entires with a blank "controller name" part does currently find the v2 controller, it is best to stick with the documented contract.
Thanks much for the PR! |
I apologize, I spoke a little bit too too soon. These lines have a bug: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R62-R64 Given the following
the code will -as written- fail to find any cgroup controllers and will fall back to the system's top-level numbers. That first column in
My horrifying code to correctly handle this is to replace line 62 with this:
Hopefully there's a way to do that is easier to read. Anyway, once this gets ironed out, and the three things from my previous comment are resolved, then we are well and truly good to go. Additional background information
|
This code is just a precheck I added to avoid the full check with splitting line into fields for all entries. I will drop this code completely. |
Done, locally. No commit, will be squashed with the coming rebase anyway. |
List of people for an ack section: IMHO @jandubois |
…ble information. The standard API for memory limits and usage under /proc are not containerized, and return host data when used in a container. This is a source of trouble for everything trying to configure itself based on available memory, causing it to work with bogus data. Inside containers the cgroup information rules instead. Trying to detect when running inside a container or not looks to have many approaches, which are either unreliable, or non-portable, or both. The approach here is to query both sets of data, and then fuse them for accuracy. As we ask for the available memory taking the min of cgroup and proc data is the safe choice for this fusion. It assumes that inside a container the cgroup data is likely less than the proc data, due to the container getting limited. And outside the container the cgroup data should not be less than proc. (That said, in the local box, having 36GiB memory cgroup reports a limit of nearly 8EiB (yes, Exa; Exact value is 8EiB-4KiB; As far as is known the absolute limit for Linux virtual memory management). As proc reports the 36GiB correctly the min is again the proper choice for such systems where cgroup on the host is that way out of whack). As for containers without limits set by docker or other runtime, cgroup data seems to be that of the host. The min-fusion again gives the best possible result, i.e. the host memory. Note that getting the cgroup information is convoluted. - We have v1 and v2 controllers. Both can be mounted, some mixture, or neither. - The location of the mounts is arbitrary. - The process the library is part of is likely not managed by the root cgroup. The actual controlling group is needed. - Swap usage is not present if swap accounting is not active. The gory details are noted in code comments. Added tests. Added clean up of the test's temp directory. Added for most of the mock file setup for more readable tests.
7f8bc4b
to
1800af1
Compare
Actions:
|
idk, I didn't do any actual work on the PR beyond nudging people so it doesn't get forgotten. I rather think @mudler should be mentioned because he did all the initial debugging to determine that this was the cause of the log-cache failures when containerized (see #48). |
👍 I had not remembered that @mudler was involved at the beginning. |
Looks good except for one last thing. Your mount-option-processing code here: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R659 does a substring match on the mount options for
|
Avoid trigger on larger strings having the target as substring
Done. Apologies for the delay. Saw the comment and then promptly forgot due to other work. Remembered today. |
The PR has been merged to the Thanks much for all the work! Do let me know if there's anything more that I need to do to make this release available for others to use. |
Thanks a lot for your diligent review! What we are really looking for now is a |
In PR #50, the behavior of GetMem() was changed to return the amount of memory used / available within the current cgroup. There are some use cases that you don't want cgroup-constrained memory information. A new GetMemIgnoringCGroups method has been introduced to provide the system wide memory information. [#179517901] Update Bosh Agent to correctly report memory used in `bosh vms --vitals` and related code paths, and publish new stemcells Signed-off-by: Brian Upton <bupton@vmware.com>
Motivation
This change is made to fix #48
See also cloudfoundry-incubator/kubecf/issues/1312
Description
The standard API for memory limits and usage under /proc are not containerized, and return host data when used in a container. This a source of trouble for everything trying to configure itself based on available memory, working from bogus data.
Inside containers the cgroup information rules instead.
Trying to detect when running inside a container or not looks to have many approaches, which are either unreliable, or non-portable, or both.
The approach here is to query both sets of data, and then fuse them for accuracy. As we ask for the available memory taking the min of cgroup and proc data is the safe choice for this fusion.
It assumes that inside a container the cgroup data is likely less than the proc data, due to the container getting limited. And outside the container the cgroup data should not be less than proc.
(That said, in the local box, having 36G memory cgroup reports a limit of 8E (yes, exa). As proc reports the 36G correctly the min is again the proper choice for such systems where cgroup on the host is that way out of whack).
As for containers without limits set by docker or other runtime, cgroup data seems to be that of the host. The min-fusion again gives the best possible result, i.e. the host memory.
Testing
First through
go run free.go
in theexamples/
subdirectory, then viago build free.go
Conclusion
With this change to
gosigar
users likelog-cache
will be able to properly see what memory is actually available to them, allowing for proper configuration without requiring user intervention. Without this changelog-cache
s auto-configuration will mis-configure itself thinking it has more memory available than actually, and user intervention, i.e. manual configuration of the memery it has, is required.