Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Query cgroup memory data and fuse with /proc data for more reliable information. #50

Conversation

andreas-kupries
Copy link
Contributor

Motivation

This change is made to fix #48
See also cloudfoundry-incubator/kubecf/issues/1312

Description

The standard API for memory limits and usage under /proc are not containerized, and return host data when used in a container. This a source of trouble for everything trying to configure itself based on available memory, working from bogus data.

Inside containers the cgroup information rules instead.

Trying to detect when running inside a container or not looks to have many approaches, which are either unreliable, or non-portable, or both.

The approach here is to query both sets of data, and then fuse them for accuracy. As we ask for the available memory taking the min of cgroup and proc data is the safe choice for this fusion.

It assumes that inside a container the cgroup data is likely less than the proc data, due to the container getting limited. And outside the container the cgroup data should not be less than proc.

(That said, in the local box, having 36G memory cgroup reports a limit of 8E (yes, exa). As proc reports the 36G correctly the min is again the proper choice for such systems where cgroup on the host is that way out of whack).

As for containers without limits set by docker or other runtime, cgroup data seems to be that of the host. The min-fusion again gives the best possible result, i.e. the host memory.

Testing

  • The modified code was run through the go compiler, indirectly.
    First through go run free.go in the examples/ subdirectory, then via go build free.go
  • It was executed on the host, and confirmed to report the proper 36G this box has.
  • It was executed in a kubecf doppler container deployed on a minikube with 25G memory.
    • For an unlimited container it properly reports the host memory, i.e. minikube's 25G.
    • For a limited container it properly reported the container's limit (256M for doppler).

Conclusion

With this change to gosigar users like log-cache will be able to properly see what memory is actually available to them, allowing for proper configuration without requiring user intervention. Without this change log-caches auto-configuration will mis-configure itself thinking it has more memory available than actually, and user intervention, i.e. manual configuration of the memery it has, is required.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/175174574

The labels on this github issue will be updated when the story is started.

@klakin-pivotal
Copy link
Contributor

klakin-pivotal commented Feb 25, 2021

Thanks much for submitting this PR. Apologies for taking so long to get to it -- there have been a few reorgs, so things have been a bit hectic here.

There are a few things that I'd like to talk about.

Firstly, this code change needs tests. We can't accept this PR without good tests.

Secondly, I notice that self.Used and self.ActualUsed are set to the values found reported by the cgroup, if and only if memory.limit_in_bytes is less than the system's RAM size. Are you sure that this is the right thing to do?

a) Consider the scenario where there's not enough free RAM on the system to handle full utilization of the remaining "free" RAM as calculated from data in the cgroup. Is that something that you should be concerned about? If not, then how is the used RAM statistic actually expected to be used?
b) Based on my testing, it seems that memory.usage_in_bytes includes buffers and cache. (I notice that the rss, page_cache, and swap components of memory.stat add up to memory.usage_in_bytes.) Given that the code that doesn't use cgroups takes pains to subtract out buffers and cache, it seems to me that you'd be reporting far too much space used with the new code as is. If you omit the page_cache number from memory.stat, then the number that comes out agrees with what htop says is the amount of actually used RAM.
c) If you agree that we can add the rss and swap components of memory.stat to arrive at the amount of actually used RAM, why not just always use that technique when reporting the amount of used RAM? For software not running in a cgroup, it does seem to report the amount of RAM used on the system. Is this code expected to run on systems that don't have cgroups configured?

I used sections 5.5 and 5.2 of this document as a reference for understanding the contents of memory.usage_in_bytes and memory.stat: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

@andreas-kupries
Copy link
Contributor Author

Firstly, this code change needs tests. We can't accept this PR without good tests.

Agreed. At the time I was unsure how to actually test these changes, and punted.
Now that I am looking again at the code I find that you are essentially mocking the /proc directory, via the Procd variable.
The tests redirect that to a temp directory and you fill it with the files of interest.

I will now do the same for /sys/fs/cgroup/memory. The variable will be SysMemd.

Secondly, I notice that self.Used and self.ActualUsed are set to the values found reported by the cgroup, if and only if memory.limit_in_bytes is less than the system's RAM size. Are you sure that this is the right thing to do?

I currently assume that you are talking about

if mlib_value > self.Total {

Is that correct ?

If yes, then the reasoning was that if cgroup claims that the application has more RAM for its use than the system has physical RAM, then cgroup's data is obviously bogus and should be ignored, and the host RAM is the limit.

This came from the fact the linux boxes I had access to reported an extremely large limit outside of a container, and for containers with no limit set also. Example:

% cat /sys/fs/cgroup/memory/memory.limit_in_bytes 
9223372036854771712
% echo '9223372036854771712/1024/1024/1024/1024/1024/1024' | bc -l
7.99999999999999644728

I.e. the reported limit is 8 EiB (Exa!). The physical RAM of that box is 64 GiB.
This is from an ubuntu box with a 5.4 kernel.
My opensuse box with the 5.3 kernel reports the same.
I cannot treat that as the proper limit.

And with this coming back to me, this was also part of the reason to not simply use cgroup data.
Not just for systems where cgroup is not configured.
But to also have a sensible fallback for the limit in the size of the host RAM.

a) ...

I have trouble understanding the question.

b) Based on my testing, it seems that memory.usage_in_bytes includes buffers and cache. (I notice that the rss, page_cache, and swap components of memory.stat add up to memory.usage_in_bytes.

Looking at two different systems (ubuntu 20.04, kernel 5.4.0-66-generic, and opensuse leap 15.2, kernel 5.3.18-lp152.60-default) I notice that neither has page_cache and swap components in memory.stat. I can only find rss. And a cache component, which may or may not be the page_cache. ... Ok. based on the memory.txt you referenced the cache is indeed the page cache information. I still do not see swap. Can you tell me what kernel the machine had where you saw the swap line ?

Checking further I see that rss and cache do not add up to usage_in_bytes as claimed, not by a long shot.

hephaistos:(508) ~ > cat /sys/fs/cgroup/memory/memory.stat|grep 'swap\|cache\|rss' ; echo ; cat /sys/fs/cgroup/memory/memory.usage_in_bytes 
cache 365584384
rss 6356992
rss_huge 0
total_cache 53424050176
total_rss 8262356992
total_rss_huge 0

61686333440
(rss:6356992)+(cache:365584384) = 371941376 ~ 1/164 * (usage_in_bytes:61686333440)

However the total_rss and total_cache values nearly do (they overshoot by 73728 in the data above).

So the general gist of the statement looks to be true, if you meant the total_... values and mistyped.
Comparing with htop output both memory.stat/total_rss, and memory.usage_in_bytes - memory.stat/total_cache indeed roughly agree. (Especially if I assume that htop prints decimal M/G and not binary M/G).

So, regarding

we can add the rss and swap components of memory.stat to arrive at the amount of actually used RAM

I can agree to that (rss -> total_rss). Even a missing swap component is ok. The value will just be a bit more inaccurate without it.

Is this code expected to run on systems that don't have cgroups configured?

I was indeed not willing to assume that cgroups (data) would be available. Thus looking for both sets of information, and fusing them if both are found.

Part of the complexity is that I was paranoid enough to not assume that presence of memory.limit_in_bytes implies presence of memory.usage_in_bytes also.

If we make that assumption, or similar that we will have both of memory.stat and memory.limit_in_bytes present, or neither, bit not one and not the other, then the middle of the new code can be completely removed.

I will now go forward with:

  • Switch to use memory.stat for usage data.
  • Remove the assumption that it is possible to have only one of memory.stat and memory.limit_in_bytes.
  • Create unit tests using directory redirection to provide mock values.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 4, 2021

CLA Signed

The committers are authorized under a signed CLA.

@andreas-kupries
Copy link
Contributor Author

Updated the PR as I proposed in my comments.

@jandubois
Copy link

Hi @klakin-pivotal, is there anything else you are expecting beyond what @andreas-kupries has added in 6edab8d ?

I was getting excited to see some momentum on this issue and hope it won't peter out before it is resolved. 😄

@klakin-pivotal
Copy link
Contributor

klakin-pivotal commented Apr 5, 2021

Yeah, page_cache was totally a thinko. I should have written cache. Also, you're correct that the code should be using the total_... values. From memory.txt:

total_<counter>		- # hierarchical version of <counter>, which in
			addition to the cgroup's own value includes the
			sum of all hierarchical children's values of
			<counter>, i.e. total_cache

I tested on a system that did not have any child cgroups, so the top-level values matched the total values.

So, your code looks good, but I think that we've two fundamental problems:

  1. Your code needs to handle both cgroup v1 and cgroup v2 formatted cgroups.

On Ubuntu 4.15.0-136, Gentoo 5.4.38 and 4.16.9 I see a swap and total_swap line in /sys/fs/cgroup/memory.

However, I spun up a Ubuntu 20.04 machine, and there is no swap entry in memory.stat.
Even enabling swap didn't make the entry in memory.stat appear.

It seems that -for some reason- Ubuntu 20.04 (and some other systems) are using the cgroup v2 memory file layout, rather than the cgroup v1 layout. The docs say when using the cgroup v2 layout, swap usage will appear in a memory.swap file, if the cgroup being inspected is not the root cgroup. (However, I have yet to see this file, even in non-root cgroups!)

Unfortunately, your code will need to handle all of these cases.

  1. Are you examining the correct cgroup? Cgroups are hierarchical, and based on what I've read, I expect that the root cgroup will often be the wrong cgroup.

I think that the correct way to do this is:

  1. Find the memory key in /proc/self/cgroup.
  2. Extract the last item from the :-separated list
  3. cd to `/sys/fs/cgroup/memory/$THE_LAST_ITEM
  4. Look for the files that your code is currently looking for.

I believe that reading /proc/self/cgroup to figure out what memory cgroup one belongs to will always be the correct thing to do because either:

  • gosigar is being included as a library in a process that uses it.
  • A process is shelling out to gosigar, making it a child of the process that needs to know about memory usage and limits.

NOTE
I don't use gosigar, but I assume that that's how people use it. Please correct me if I misunderstand how the tool is used.

@andreas-kupries
Copy link
Contributor Author

Ok. Thanks for the additional review and comments.
AFAIK/AFAICT your notes about how gosigar is used are correct.

I will now work on fixing the issues you mentioned. Starting with using /proc/self/cgroup.

@andreas-kupries
Copy link
Contributor Author

However, I spun up a Ubuntu 20.04 machine, and there is no swap entry in memory.stat.
Even enabling swap didn't make the entry in memory.stat appear.

That matches what I see with my 20.04 boxes

The docs say when using the cgroup v2 layout, swap usage will appear in a memory.swap file, if the cgroup being inspected is not the root cgroup. (However, I have yet to see this file, even in non-root cgroups!)

I have no memory.swap file in my 20.04 and leap boxes either, neither under root, nor for the user.slice cgroup reported for memory in /proc/self/cgroup.

Hm ? Can you provide a link to the docs you read ? I.e. location of the docs from your sentence.

I found https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html and it does not mention memory.swap at all. In its place it seems to have a number of separate memory.swap.<> files, with <> in { current, max, high, events }.

Of course, none of these can be found on my 20.04 box either (That would be too easy).

Hm ... latest in that url, seems to be for kernel 5.12.x ... My 20.04 system uses a 5.4.x kernel ...
Docs for that looks to be https://www.kernel.org/doc/html/v5.4/admin-guide/cgroup-v2.html
And even that already talks about separate memory.swap.<> files :(
Ditto 5.3, which is the kernel used by the leap box.

And https://www.kernel.org/doc/Documentation/cgroup-v2.txt also only talks about memory.swap.current.

Have you managed to find examples for the content of memory.swap, and/or memory.swap.current ?
(For the latter I assume/suspect that it is a single value, swap in bytes, no keying. Not sure however)

@andreas-kupries
Copy link
Contributor Author

andreas-kupries commented Apr 9, 2021

Five minutes after posting a research mail I find https://doc.opensuse.org/documentation/leap/archive/42.1/tuning/html/book.sle.tuning/cha.tuning.cgroups.html and inside I read

Control swap usage by setting swapaccount=1 as a kernel boot parameter.

Now checking if my machines have swap accounting active.

@andreas-kupries
Copy link
Contributor Author

andreas-kupries commented Apr 9, 2021

Per https://unix.stackexchange.com/questions/147158/how-to-enable-swap-accounting-for-memory-cgroup-in-archlinux

With swap accounting active we should see/have /sys/fs/cgroup/memory/memory.memsw.limit_in_bytes.

% ll /sys/fs/cgroup/memory/memory.memsw.limit_in_bytes
/bin/ls: cannot access '/sys/fs/cgroup/memory/memory.memsw.limit_in_bytes': No such file or directory
% ll /sys/fs/cgroup/memory/memsw
/bin/ls: cannot access '/sys/fs/cgroup/memory/memsw': No such file or directory

Further looking at /etc/default/grub ...

% grep CMDLINE_L /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="splash=silent resume=/dev/disk/by-uuid/3549eaa1-078d-46fb-982f-0b770493ca47 mitigations=auto quiet"
GRUB_CMDLINE_LINUX=""

Ditto/similar for the ubuntu box.

Strong indications that I do not have swap accounting active.

Will now activate that and see if it makes a difference.

@andreas-kupries
Copy link
Contributor Author

andreas-kupries commented Apr 9, 2021

@klakin-pivotal

With swap accounting activated the memory.stat file gains the keys swap and total_swap.
The candidate cgroup v2 files (memory.swap, memory.swap.current) do not appear.
This is openSUSE Leap 15.2. I confirm the same the same for Ubuntu 20.04.

My conclusion is that kernel 5.3.18 and 5.4.0 are still using cgroup v1, not v2.

At this point I believe that support for cgroup v2 can be defered until we actually have a machine using it.

@andreas-kupries
Copy link
Contributor Author

If you can agree to the deferral of cgroup v2 support the PR should be ready again.

@andreas-kupries
Copy link
Contributor Author

... More information ... Thanks to feedback from research ... It is possible to forcibly boot a system into something using cgroup v2 via the cgroup_no_v1=all kernel parameter.

Doing that with Ubuntu (to keep the work box up) the file memory.swap.current appears. It contains the used swap, in bytes.
Finding the proper cgroup requires looking for *::/path/to/cgroup ... I.e. the memory key was not present.
Note, swap accounting was active at that point.

I suspect that memory.swap.current will disappear when I disable swap accounting again.
Confirmed. Oh, and realized that the memory path segment went away also. That looks to be a v1 thing too.

It should be possible to add the v2 support with that.

@andreas-kupries
Copy link
Contributor Author

@klakin-pivotal @jandubois Added support for cgroup v2 now. Reorganized the internals a fair bit.

The data extraction functions for the main pieces all have the prefix determine now. Where v2 and v1 needed to be supported these functions look for the v2 files first, then the v1 files, and extract from whatever is found. The higher layer just calls these functions and does not care about v2 versus v1.

The tests got extended to cover these new extraction functions, and the v2 support.
This got reorganized also, with a number of new helper functions for setting up the context to be tested.
The actual test cases are now calling on these helpers. This made them short and readable.

Please review.

@klakin-pivotal
Copy link
Contributor

klakin-pivotal commented Apr 20, 2021

I haven't run the tests, but I like your code so far. However the code has one major issue, and I've a few minor quibbles.

The major issue:
Both the cgroup v1 and v2 controllers can be mounted anywhere on the filesystem. Your code assumes that they'll be in /sys/fs/cgroup/memory and /sys/fs/cgroup/unified, which are good assumptions, but you're gonna need some fallback code to search for them if you can't find what you're looking for in either one.
Finding the v2 cgroup in /proc/self/mounts is easy. Find the one entry with a type of cgroup2, grab the path to the mount point, and use it as your controller root. I'd strongly consider raising a fatal error if you find more than one entry... but I don't know the proper mechanism to raise a fatal error in the gosigar project.
Finding the v1 memory controller is quite a bit more annoying. You can't rely on memory being anywhere in the mount point; you have to find the line with a type of cgroup and a mount option of memory. Again, if you find more than one, I'd consider that to be a fatal error.

(And yeah, I spent like half a day trying to figure out if there was a better way to find the v1 memory cgroup controller. If you can find a better way than that, please do let me know!)

The minor quibbles:
It turns out that that 8EiB number that we're seeing in memory.limit_in_bytes is 2^63 - 4096. A month or so ago, I remember reading that this is the maximum limit of the Linux virtual memory system.
However, since I can't dig up any docs that agree with my memory, would you be willing to mention in your comment here https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R121 that that 8EiB number number is 2^63 - 4096?

What's the deal with this commented-out code? Should it be removed, un-commented and worked on? https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R410-R412

Is it safe to not check the error return of stroull here: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R393? If I'm reading the code correctly, the caller is throwing away any error object that it generates: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R151

(I also notice that you are checking the error return of determineMemoryUsage: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R145)

I'm fairly certain that the "global controller" should be called the "cgroup v2 memory controller": https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R470

Thanks for sticking with this.

@andreas-kupries
Copy link
Contributor Author

I haven't run the tests, but I like your code so far. However the code has one major issue, and I've a few minor quibbles.

The major issue:
Both the cgroup v1 and v2 controllers can be mounted anywhere on the filesystem. Your code assumes that they'll be in /sys/fs/cgroup/memory and /sys/fs/cgroup/unified, which are good assumptions, but you're gonna need some fallback code to search for them if you can't find what you're looking for in either one.

Commit c7bedf3 now searches by default, and uses the conventional locations as last-stop fallback.

Finding the v2 cgroup in /proc/self/mounts is easy. Find the one entry with a type of cgroup2, grab the path to the mount point, and use it as your controller root. I'd strongly consider raising a fatal error if you find more than one entry... but I don't know the proper mechanism to raise a fatal error in the gosigar project.

I decided to go with panic.

Finding the v1 memory controller is quite a bit more annoying. You can't rely on memory being anywhere in the mount point; you have to find the line with a type of cgroup and a mount option of memory. Again, if you find more than one, I'd consider that to be a fatal error.

Ditto panic.

(And yeah, I spent like half a day trying to figure out if there was a better way to find the v1 memory cgroup controller. If you can find a better way than that, please do let me know!)

The minor quibbles:
It turns out that that 8EiB number that we're seeing in memory.limit_in_bytes is 2^63 - 4096. A month or so ago, I remember reading that this is the maximum limit of the Linux virtual memory system.
However, since I can't dig up any docs that agree with my memory, would you be willing to mention in your comment here https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R121 that that 8EiB number number is 2^63 - 4096?

Done, commit 7ef1556

What's the deal with this commented-out code? Should it be removed, un-commented and worked on? https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R410-R412

Done, removed. Part of commit 86c8dc5

Is it safe to not check the error return of stroull here: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R393? If I'm reading the code correctly, the caller is throwing away any error object that it generates: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R151

Commit 86c8dc5
While the error is still thrown away, it is now done in the caller, with explanation for why.

(I also notice that you are checking the error return of determineMemoryUsage: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R145)

Well, memory usage is a required thing. Swap usage missing on the other hand we can work around.

I'm fairly certain that the "global controller" should be called the "cgroup v2 memory controller": https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R470

Commit 49c490a

@klakin-pivotal
Copy link
Contributor

Okay, I looked over (and ran) the tests. I think we have three things left before I can attempt to merge this in and make a release:

  1. The code that this test exercises seems incorrect: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-833f667aa1788f36437012c5b86b41654b76da4b46faa0e48c4f44b932645690R217-R223

The v2 cgroup controller will -apparently- always be controller id 0, as mentioned in the official docs: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#organizing-processes-and-threads.

The entry for cgroup v2 is always in the format “0::$PATH”

While searching for entires with a blank "controller name" part does currently find the v2 controller, it is best to stick with the documented contract.

  1. If reasonably possible, please rebase on master and squash these commits down to a few commits that make sense.
  2. If you're interested, do please give me a list of people in addition to yourself to add to an Acknowledgements section in the release notes. (If you'd rather I not add such a section, I can do that, too.)

Thanks much for the PR!

@klakin-pivotal
Copy link
Contributor

klakin-pivotal commented Apr 23, 2021

I apologize, I spoke a little bit too too soon.

These lines have a bug: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R62-R64

Given the following /proc/mounts

<...irrelevant lines snipped...>
memory /cgroup2/memory cgroup rw,memory 0 0

the code will -as written- fail to find any cgroup controllers and will fall back to the system's top-level numbers. That first column in /proc/mounts (the device column) can be pretty much anything you want it to be when you're mounting a pseudo-filesystem like the cgroup FSes. I set mine up like that by doing:

mkdir /cgroup2/memory
mount -t cgroup memory /cgroup2/memory -o memory

My horrifying code to correctly handle this is to replace line 62 with this:

if !(strings.HasPrefix(strings.Fields(line)[2], "cgroup")) {

Hopefully there's a way to do that is easier to read.

Anyway, once this gets ironed out, and the three things from my previous comment are resolved, then we are well and truly good to go.

Additional background information

systemd tends to set the device name to the type of the pseudo-filesystem. A non-systemd Gentoo Linux system sets the device name to the name of the controller (or unified) for the cgroup2 FS type. Any user can -as we've seen- set the device name to whatever they'd like it to be.

@andreas-kupries
Copy link
Contributor Author

These lines have a bug: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R62-R64

This code is just a precheck I added to avoid the full check with splitting line into fields for all entries.
In hindsight, a premature optimization which became a wrong optimization.

I will drop this code completely.
I will also place this into a function, and then write tests for it.
Given that we cannot test the init() part of the package.

@andreas-kupries
Copy link
Contributor Author

The code that this test exercises seems incorrect: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-833f667aa1788f36437012c5b86b41654b76da4b46faa0e48c4f44b932645690R217-R223

Done, locally. No commit, will be squashed with the coming rebase anyway.

@andreas-kupries
Copy link
Contributor Author

List of people for an ack section:

IMHO @jandubois

…ble information.

The standard API for memory limits and usage under /proc are not
containerized, and return host data when used in a container. This is
a source of trouble for everything trying to configure itself based on
available memory, causing it to work with bogus data.

Inside containers the cgroup information rules instead.

Trying to detect when running inside a container or not looks to have
many approaches, which are either unreliable, or non-portable, or both.

The approach here is to query both sets of data, and then fuse them for
accuracy. As we ask for the available memory taking the min of cgroup and
proc data is the safe choice for this fusion.

It assumes that inside a container the cgroup data is likely less than
the proc data, due to the container getting limited. And outside the
container the cgroup data should not be less than proc.

(That said, in the local box, having 36GiB memory cgroup reports a
limit of nearly 8EiB (yes, Exa; Exact value is 8EiB-4KiB; As far as is
known the absolute limit for Linux virtual memory management).

As proc reports the 36GiB correctly the min is again the proper choice
for such systems where cgroup on the host is that way out of whack).

As for containers without limits set by docker or other runtime, cgroup
data seems to be that of the host. The min-fusion again gives the best
possible result, i.e. the host memory.

Note that getting the cgroup information is convoluted.

  - We have v1 and v2 controllers. Both can be mounted, some mixture,
    or neither.

  - The location of the mounts is arbitrary.

  - The process the library is part of is likely not managed by the
    root cgroup. The actual controlling group is needed.

  - Swap usage is not present if swap accounting is not active.

The gory details are noted in code comments.

Added tests.
Added clean up of the test's temp directory.
Added for most of the mock file setup for more readable tests.
@andreas-kupries andreas-kupries force-pushed the ak-gosigar-48-kubecf-1312-mem-limits branch from 7f8bc4b to 1800af1 Compare April 23, 2021 09:33
@andreas-kupries
Copy link
Contributor Author

Actions:

  • Fixed cgroup v2 extraction, now match 0::/path instead of *::/path. v1 still matches *:memory:/path.
  • Removed bogus gating check in the code extracting the controller mount points
  • Moved that code into a function called by init(), and added tests for the function.
  • Rebased to latest origin/master and squashed history.
  • Ack section: Ok with it, and adding @jandubois

@jandubois
Copy link

List of people for an ack section:

IMHO @jandubois

idk, I didn't do any actual work on the PR beyond nudging people so it doesn't get forgotten. I rather think @mudler should be mentioned because he did all the initial debugging to determine that this was the cause of the log-cache failures when containerized (see #48).

@andreas-kupries
Copy link
Contributor Author

List of people for an ack section:
IMHO @jandubois

idk, I didn't do any actual work on the PR beyond nudging people so it doesn't get forgotten. I rather think @mudler should be mentioned because he did all the initial debugging to determine that this was the cause of the log-cache failures when containerized (see #48).

👍 I had not remembered that @mudler was involved at the beginning.

@klakin-pivotal
Copy link
Contributor

Looks good except for one last thing. Your mount-option-processing code here: https://github.com/cloudfoundry/gosigar/pull/50/files#diff-b9503745e5e2cb945eb27d56d7b4706d2cd33b9f6542f8231818c07d43f76225R659 does a substring match on the mount options for memory. To avoid false positives, it should either

  • Create a list out of the comma-separated mount options list and search for the memory mount option
  • Search for all possible substrings for the memory mount option (,memory, ^memory, ^memory$, ,memory$.)

Avoid trigger on larger strings having the target as substring
@andreas-kupries
Copy link
Contributor Author

Done. Apologies for the delay. Saw the comment and then promptly forgot due to other work. Remembered today.

@klakin-pivotal klakin-pivotal merged commit e52716b into cloudfoundry:master Apr 30, 2021
@klakin-pivotal
Copy link
Contributor

The PR has been merged to the master branch and a Github Release has been created: https://github.com/cloudfoundry/gosigar/releases/tag/v1.2.0

Thanks much for all the work!

Do let me know if there's anything more that I need to do to make this release available for others to use.

@jandubois
Copy link

Do let me know if there's anything more that I need to do to make this release available for others to use.

Thanks a lot for your diligent review! What we are really looking for now is a log-cache release that uses the updated gosigar. 😄 If you know somebody on that team, please ping them. Otherwise I'll try to find them next week on Slack...

@andreas-kupries andreas-kupries deleted the ak-gosigar-48-kubecf-1312-mem-limits branch May 3, 2021 07:20
ystros pushed a commit that referenced this pull request Sep 9, 2021
In PR #50, the behavior of GetMem() was changed to return the amount of
memory used / available within the current cgroup. There are some use
cases that you don't want cgroup-constrained memory information. A new
GetMemIgnoringCGroups method has been introduced to provide the system
wide memory information.

[#179517901] Update Bosh Agent to correctly report memory used in `bosh vms --vitals` and related code paths, and publish new stemcells

Signed-off-by: Brian Upton <bupton@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory calculation inside containerized environments
4 participants