-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC not accounted as MemAvailable in /proc/meminfo #10255
Comments
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to `MemAvailable` (see openzfs/zfs#10255), leading earlyoom to believe we are out of memory when we still have a lot of memory available (in practice, many GBs). Thus, until now, earlyoom tended to kill processes on ZFS systems even though there was no memory pressure. This commit fixes it by adding the `size` field of `/proc/spl/kstat/zfs/arcstats` to `MemAvailable`. The effect can be checked easily on ZFS systems: Before this commit, dropping the ARC via (command from [1]) echo 3 | sudo tee /proc/sys/vm/drop_caches would result in an increase of free memory in earlyoom's output; with this fix, it stays equal. [1]: https://serverfault.com/a/857386/128321
I've encountred the same problem with qemu/kvm. Despite ARC being cache-like, it would not free when qemu/kvm requested ram. |
Another relevant question: Even if you did want to update every single application instead, by what value would you have to adjust I am trying |
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to `MemAvailable` (see openzfs/zfs#10255), leading earlyoom to believe we are out of memory when we still have a lot of memory available (in practice, many GBs). Thus, until now, earlyoom tended to kill processes on ZFS systems even though there was no memory pressure. This commit fixes it by adding the `size` field of `/proc/spl/kstat/zfs/arcstats` to `MemAvailable`. The effect can be checked easily on ZFS systems: Before this commit, dropping the ARC via (command from [1]) echo 3 | sudo tee /proc/sys/vm/drop_caches would result in an increase of free memory in earlyoom's output; with this fix, it stays equal. [1]: https://serverfault.com/a/857386/128321
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to `MemAvailable` (see openzfs/zfs#10255), leading earlyoom to believe we are out of memory when we still have a lot of memory available (in practice, many GBs). Thus, until now, earlyoom tended to kill processes on ZFS systems even though there was no memory pressure. This commit fixes it by adding the `size` field of `/proc/spl/kstat/zfs/arcstats` to `MemAvailable`. The effect can be checked easily on ZFS systems: Before this commit, dropping the ARC via (command from [1]) echo 3 | sudo tee /proc/sys/vm/drop_caches would result in an increase of free memory in earlyoom's output; with this fix, it stays equal. [1]: https://serverfault.com/a/857386/128321
This is something we can take another look at. When this code was originally integrated with the kernel there were a variety of technical issues which prevented us from reporting the ARC space as page cache pages. Since then the Linux kernel has changed considerably, as has ZFS, so this may now be more feasible and is definitely worth re-investigating. |
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache. However, in contrast to the buffer cache, it currently does not count to `MemAvailable` (see openzfs/zfs#10255), leading earlyoom to believe we are out of memory when we still have a lot of memory available (in practice, many GBs). Thus, until now, earlyoom tended to kill processes on ZFS systems even though there was no memory pressure. This commit fixes it by adding the `size` field of `/proc/spl/kstat/zfs/arcstats` to `MemAvailable`. The effect can be checked easily on ZFS systems: Before this commit, dropping the ARC via (command from [1]) echo 3 | sudo tee /proc/sys/vm/drop_caches would result in an increase of free memory in earlyoom's output; with this fix, it stays equal. [1]: https://serverfault.com/a/857386/128321
@behlendorf shouldn't this issue be prioritized / be considered as a blocker for the ZFS 2.0 release? |
I've also run into this issue booting KVM VMs. It's pretty confusing to have them fail to boot for a reason completely unrelated to KVM. As a workaround, I manually dropped the size of the ARC cache and immediately expanded it after the VM had allocated it's memory.
|
Probably my own fault, but due to limited hardware my backup ZFS server is an old laptop with only 8 GB total system RAM which is shared with the integrated GPU. ZFS is the only thing that can run on it after a few minutes because other programs do not recognize the utilized memory as the cache memory it is. This causes every program to behave as if my system genuinely has 7.5/7.8 GB utilized. Quite annoying, and it will likely only get worse if I make a larger pool (I assume). Not so bad on my main server since it has more memory and also does not really run any graphical applications given it's headless. My backup machine though is airgapped to protect from power surges when I'm away and due to its location has terrible networking. It uses its graphical interface to read incremental backup data off the shuttle drives I use. If this issue is fixed at some point that would make my boss reconsider using ZFS on our servers given how much he likes the snapshots and scrubbing functionality. Until the memory issues can be dealt with (ideally without capping the ARC size) the senior admin will lean heavily on using hardware RAID on Windows server. Not exactly a bad solution but I think ZFS has some advantages over traditional RAID and that Linux and UNIX both make decent hosts for virtualisation. Ideally I would be planning to deploy ZFS as my boot partition or at least for my home and manually installed application folders once an acceptable solution to the RAM usage thing is found. Personally I don't want to limit the ARC size since I like the idea of using as much of my free RAM as possible as a cache just as long as my applications see it as cache and act normally. |
to be fair, this is not a problem the OpenZFS project has any say over. if they are the type who feel fine running Windows Server on a laptop with 8GB RAM (and somehow capping the ARC is not a solution?) then there's no problems. they seem to be satisfied with Windows Server, and it meets their needs. |
It's me using the laptop due limited funds... that and it has only one job: store my backups. They're fine using no ZFS and Windows server (or real servers) becasue that is what the entire firm is familiar with. I will be trying the capping of the size of ARC myself but personally I like a large cache in the RAM just as long as the applications sharing that RAM see it as such. Other than my edge case, ZFS is a really good system, on my main machine which has a reasonable amount of memory ZFS has been the best decision I ever made due to unstable CPU causing crashes and ZFS so far being the most fault tolerant solution I have used to date. They do see to have issues though since they are considering giving me a Linux VM to try one of their services on to see if we could deploy Linux in the future, time will tell if the small Linux VM will be convincing. |
Apologies for the noise, but has anyone had a chance to take another look at this? |
If you found this like I did—because Firefox was unloading tabs thinking there wasn't enough memory—then you may find the
I set mine to 3gb with |
Thanks for that that one, @KenMacD. Using that parameter seems a bit more flexible than As a sidenote for people who want to make these changes permanent, follow the instructions here: https://www.cyberciti.biz/faq/how-to-set-up-zfs-arc-size-on-ubuntu-debian-linux/ Just don't forget the update of the initramfs, like I did. 🙄 |
A solution here would be great. Props to htop for reporting sensible free memory (counts shrinkable ARC as cache My latest run-in is compiling UnrealEngine and getting nerfed parallel actions
where in reality >40GB is available. |
I'm running NixOS + ZFS w/ Framework 13 AMD Laptop, been dealing with random crashes for 2 weeks! (didn't know what was causing them). Just now saw in journalctl that it was earlyoom. (I even submitted a bug report to vscode lol). Is anyone currently actively working on this issue? If so, is there anything I can do to help? If not, I would love to delve in and try to fix this bug. Does anyone have any tips as to what docs / code I should read first? |
Is a solution even technically possible, or is it in the hands of the kernel? |
Unfortunately it's not about where things are accounted to, as such. We don't get to choose that. Linux accounts a certain type of memory through MemAvailable (roughly, "page cache"). That's not the type of memory the ARC uses, for both historical and practical reasons. Its unclear if it's possible to convert the ARC to use that kind of memory. I think at least a prototype is possible, but it's an enormous job, and I don't know if all the niches are possible. I wouldn't do it just to change where the memory is accounted to though; the main reason for me would be to reduce copies on mmap-heavy workloads. It's something I'd like to try but I'm extremely unlikely to get to it any time soon. I gotta say though, I am slightly inclined to point the finger at earlyoom here, because it's not implementing the same logic as as the true OOM killer. OpenZFS registers clean up callbacks with the kernel; when it's under memory pressure the kernel will ask OpenZFS to make memory available, which it will do. If earlyoom is basing it's kill decisions solely on kernel memory stats, then it could be argued that its not precise enough. It seems like it might be straightforward to patch it to include ARC size in its accounting. Has anyone asked about this? I didn't see an issue on the tracker. |
As mentioned earlier in this issue, it's not just earlyoom, or just all daemons for early oom purpose that you'd need to patch for specific ARC handling. |
There was this earlyoom PR 191 for earlyoom, but it was closed due to lack of testing. oomd makes use of PSI (pressure stall information). We make use of earlyoom with custom ZFS-awareness patches, but would prefer everything to be happy upstream. Sure, projects can point to other projects doing the Wrong Thing, but the definition of MemAvailable is
When ARC is a) responsive to memory pressure and b) not accounted for in MemAvailable, it's hard for userspace to determine system health. "Should I trigger an OOM kill?" or "Is there enough memory to launch X?" or "Should Firefox unload this tab?" tend to rely on the MemAvailable abstration, and "fixing" one app doesn't help the others. Would it be feasible to hook into the MemAvailable accounting (suggested by @behlendorf, April 2020)? |
I don't disagree that having it all one place would be easier for tools that don't know the difference. What I'm saying is that its hard to do, maybe impossible. Wanting it more doesn't change that. |
Any update on when this can be addressed? I think this is causing my BeeLink GTR7PRO with 96 GB RAM Proxmox 8.2.2 with ZFS to crash about once a week when doing a vm backup.
Logs right before Proxmox System kernel panicked:
zpool status -v returned errors around the virtual disk file for this VM 102. I was able to reboot and repair the pool and then start the VM. |
@sjau Did you ever resolve your issue(s) with qemu/kvm requested ram? |
Any idea if there is more discussions on low level ZFS and Linux? @behlendorf |
My solution was getting more ram ;) |
I recently had a server running Proxmox kill a VM due to an OOM. It was very quick though - none of my containers or VMs showed a memory increase, and my graphs only show a sharp increase in the ARC usage right before the OOM at 4:10 AM: We can also see how the ARC was larger than before the OOM, which indicates it wasn't being constrained or freed properly. This did occur during a regular scrub, but the scrub was started hours before. This is all with ZFS Looking into this, I found a note at https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage which indicated that Proxmox is now more aggressively limiting ARC memory by default on new installs, to no more than 10% of installed RAM or 16GB. I've done the same, but honestly this feels like a failure in the ARC that's just being masked. |
I created /etc/modprobe.d/zfs.conf file and went with a 8 GB ZFS ARC Max.
Before this change here are the arc stats, definitely do not need 48 GB RAM (c_max) going to ZFS RAID 1 mirror (2 x 2 TB Samsung 990 Pro NVME Gen4 drives).
After 'update-initramfs -u -k all' & reboot with new config file c_max now shows 8GB
|
Thinking that a proper work around would be to configure the zfs_arc_max value and update a corresponding Linux boot parameter to set the max ram to be physical memory minus this zfs_arc_max amount? Researching how to set max memory for Proxmox for Debian 12. @behlendorf - Is this idea possible - would limiting the Linux RAM work or would their still be a memory type difference issue? |
@ThomasLamprecht How does Proxmox handle this scenario? My server started off as Proxmox 8.0 and upgraded via I think my server was crashing when vm(s) were consuming memory (I have an ai vm that has a large language model, so it goes from using 1.5 GB Ram to over 64 GB RAM) and a VM backup occurred at the same time then Linux crashed with a Memory fault because it was not aware of this ZFS ARC memory? Wonder if their is also a use case for Proxmox to change the default zfs_arc_sys_free configuration. |
Yeah, we strictly limit setting up our dynamic (lower) default ARC size on installation, as we do not want to mess with existing systems, as we cannot really differentiate if the admin explicitly wants to have ZFS use 50% or not.
Sudden usage memory spikes can indeed result in OOM if the memory cannot be reclaimed fast enough, albeit I'm not sure about the specifics w.r.t. interaction between kernel and ZFS here from top of my head.
Not so sure about that, as the requirements for that value are very dependent on the actual use case. E.g., if VMs, or other huge memory workloads like your AI tool, are frequently started and stopped again it could be good to reserve the fluctuation by using So, at least for Proxmox VE this is IMO something that should be handled by the administrators for their specific use case. |
Update - After I configured /etc/modprobe.d/zfs.conf, my GTR7Pro AMD based Proxmox hypervisor has not crashed/hung. VM's and their applications have been running without issue. However, most backups of my main VM error. If I move this main VM to completely different hardware (an older Intel Host), then it successfully backups and runs without issue. Today, I updated from Proxmox 8.2.2 to 8.2.4. Initial testing, I was able to successfully backup the main vm without an error after this update. I report back in a week to see if my issue is truly resolved. Upgrade:
|
Update: I have not had any ZFS related errors since updating to Proxmox v.8.2.4 @ThomasLamprecht Great work on Proxmox v.8.2.4, really appreciate your team. System Log has been quiet for the last 4 days:
|
I've noticed ARC getting really large when I create a lot of Docker containers, which eventually causes OOM issues with Ubuntu 24.10. My workaround is to have this in my # Set zfs_arc_max to half currently free memory or 8 Gb, whichever is lower.
half_free_mem_now="$(free -bL | awk '{print $8/2}')"
echo "Half free mem: $((half_free_mem_now / 1073741824)) Gb"
[[ $half_free_mem_now -lt 8589934592 ]] && echo "${half_free_mem_now}" > /sys/module/zfs/parameters/zfs_arc_max || echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max
echo "zfs_arc_max => $(($(cat /sys/module/zfs/parameters/zfs_arc_max) / 1073741824)) Gb"
zfs_arc_size="$(awk '/^size/ { print $1 " " $3 / 1073741824 }' /proc/spl/kstat/zfs/arcstats)"
echo "ZFS ARC ${zfs_arc_size} Gb" If I find I need to dramatically reduce arc, I also run this: sudo sync && echo 2 | sudo tee /proc/sys/vm/drop_caches I've also been using I added this to my zfs_arc_size="$(awk '/^size/ { print $1 " " $3 / 1073741824 }' /proc/spl/kstat/zfs/arcstats)"
echo "ZFS ARC ${zfs_arc_size} Gb"
smem -t -k -w This gives me a good sense of how things are: ZFS ARC size 3.78197 Gb
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 6.0G 869.9M 5.2G
userspace memory 6.3G 2.0G 4.4G
free memory 33.3G 33.3G 0
----------------------------------------------------------
45.6G 36.1G 9.5G The ARC is in |
@satmandu You haven't specified ZFS version you are using, but in 2.3 I've made few changes to make ZFS more responsive to kernel memory requests. It should also report reclaimable memory to the kernel, which might be visible to some external things, like memory balloon drivers, etc. |
Apologies! I am using OpenZFS 2.3.0-rc2, from which I am generating a PPA here: https://launchpad.net/~satadru-umich/+archive/ubuntu/zfs-experimental The issues cropped up after updating to Ubuntu 24.10. I wasn't having issues with the 2.3.0 rc series on ubuntu 24.04, I think. (And this is on either 6.11.x or 6.12.0-rc kernels.) |
@satmandu Try to set |
@amotin Setting Thank you for the suggestion: ZFS ARC size 33.2544 Gb
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 36.1G 3.6G 32.5G
userspace memory 5.3G 582.8M 4.7G
free memory 4.2G 4.2G 0
----------------------------------------------------------
45.6G 8.4G 37.2G |
@shodanshok ^^^ |
Can I ask if they are "true" kernel OOM (with #16313 should ignore |
I believe that sudo apt remove earlyoom
Package 'earlyoom' is not installed, so not removed
Summary:
Upgrading: 0, Installing: 0, Removing: 0, Not Upgrading: 1 Sample from dmesg when I was having these issues. Happy to disable |
@shodanshok without From the current dmesg: |
@satmandu thanks for your tests. From the log, I can see all OOM are due to
It is somewhat surprising, as legacy kernel behavior was to try direct reclaims when
As I understand it, this means that if If you are willing to do some more tests:
Thanks. |
Looks like I do have it enabled: cat /sys/kernel/mm/lru_gen/enabled
0x0007 I'll see what happens when I disable it... |
Both disabling MGLRU with |
First ZFS pool ever created by me and I've hit the problem consistently trying to move my data from root to a dataset. Still getting used to how ZFS works, but that was very unsettling, because the oomkiller killed my cp process running via ssh. After setting |
Scratch that. Just got oomkilled again. Attempting with MGLRU disabled |
And no dice. Got oomkiller even faster. I kind of have a feeling it's not respecting my arc size... edit: I'm on 2.2.6 btw, so maybe I'll have to wait for 2.3 for things to get better? I'm actually surprised with how badly it performs out of the box for me, since ZFS has a reputation of being very stable (unlike, say, btrfs where I'm coming from) |
I eventually ended up with a lockup when I had I switched to |
Describe the problem you're observing
As described in e.g. here, ARC does not count the same way as the Linux buffer cache (e.g. yellow in
htop
), and ZoL does not use the buffer cache. ARC shows up as green inhtop
.More importantly, ARC does not count as
MemAvailable
in/proc/meminfo
.This interacts badly with various memory-related tools on Linux.
Most prompinently, software like
earlyoom
(and possibly Facebook's similaroomd
) that can be used to avoid hangs due to inefficiencies in Linux's OOM killer that result in ~20 minute system stalls (recent news coverage e.g. here). I observe this myself daily, when programs get killed byearlyoom
even though GBs could (and would) be available from ARC, and others have also noticed.This puts ZoL and its users at a disadvantage and forbids its usage in some cases.
Describe how to reproduce the problem
earlyoom
on desktop system with ZoL.echo 3 | sudo tee /proc/sys/vm/drop_caches
.Suggested solution
It seems the best solution would be to count reclaimable ARC memory in to the kernel's
MemAvailable
statistic. This would be better than having to adjust every single application to read e.g./proc/spl/kstat/zfs/arcstats
and do computations itself.The text was updated successfully, but these errors were encountered: