Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC not accounted as MemAvailable in /proc/meminfo #10255

Open
nh2 opened this issue Apr 25, 2020 · 46 comments
Open

ARC not accounted as MemAvailable in /proc/meminfo #10255

nh2 opened this issue Apr 25, 2020 · 46 comments
Labels
Component: Memory Management kernel memory management

Comments

@nh2
Copy link

nh2 commented Apr 25, 2020

Describe the problem you're observing

As described in e.g. here, ARC does not count the same way as the Linux buffer cache (e.g. yellow in htop), and ZoL does not use the buffer cache. ARC shows up as green in htop.

More importantly, ARC does not count as MemAvailable in /proc/meminfo.

This interacts badly with various memory-related tools on Linux.

Most prompinently, software like earlyoom (and possibly Facebook's similar oomd) that can be used to avoid hangs due to inefficiencies in Linux's OOM killer that result in ~20 minute system stalls (recent news coverage e.g. here). I observe this myself daily, when programs get killed by earlyoom even though GBs could (and would) be available from ARC, and others have also noticed.

This puts ZoL and its users at a disadvantage and forbids its usage in some cases.

Describe how to reproduce the problem

  • Install earlyoom on desktop system with ZoL.
  • Do some heavy IO, then some heavy browsing with many tabs, observe some of them getting killed.
  • Observe that this stops when you do echo 3 | sudo tee /proc/sys/vm/drop_caches.

Suggested solution

It seems the best solution would be to count reclaimable ARC memory in to the kernel's MemAvailable statistic. This would be better than having to adjust every single application to read e.g. /proc/spl/kstat/zfs/arcstats and do computations itself.

nh2 added a commit to nh2/earlyoom that referenced this issue Apr 25, 2020
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache.
However, in contrast to the buffer cache, it currently does not count
to `MemAvailable` (see openzfs/zfs#10255),
leading earlyoom to believe we are out of memory when we still have
a lot of memory available (in practice, many GBs).

Thus, until now, earlyoom tended to kill processes on ZFS systems
even though there was no memory pressure.

This commit fixes it by adding the `size` field of
`/proc/spl/kstat/zfs/arcstats` to `MemAvailable`.

The effect can be checked easily on ZFS systems:

Before this commit, dropping the ARC via (command from [1])

    echo 3 | sudo tee /proc/sys/vm/drop_caches

would result in an increase of free memory in earlyoom's output;
with this fix, it stays equal.

[1]: https://serverfault.com/a/857386/128321
@sjau
Copy link

sjau commented Apr 26, 2020

I've encountred the same problem with qemu/kvm. Despite ARC being cache-like, it would not free when qemu/kvm requested ram.

@nh2
Copy link
Author

nh2 commented Apr 28, 2020

Another relevant question:

Even if you did want to update every single application instead, by what value would you have to adjust MemAvailable?

I am trying + size - c_min - arc_meta_min, but there is still a 300-500 MB difference between that and what e.g. htop shows that's unnaccounted for, which does not seem to exist on a similar ext4 system.

nh2 added a commit to nh2/earlyoom that referenced this issue Apr 28, 2020
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache.
However, in contrast to the buffer cache, it currently does not count
to `MemAvailable` (see openzfs/zfs#10255),
leading earlyoom to believe we are out of memory when we still have
a lot of memory available (in practice, many GBs).

Thus, until now, earlyoom tended to kill processes on ZFS systems
even though there was no memory pressure.

This commit fixes it by adding the `size` field of
`/proc/spl/kstat/zfs/arcstats` to `MemAvailable`.

The effect can be checked easily on ZFS systems:

Before this commit, dropping the ARC via (command from [1])

    echo 3 | sudo tee /proc/sys/vm/drop_caches

would result in an increase of free memory in earlyoom's output;
with this fix, it stays equal.

[1]: https://serverfault.com/a/857386/128321
nh2 added a commit to nh2/earlyoom that referenced this issue Apr 28, 2020
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache.
However, in contrast to the buffer cache, it currently does not count
to `MemAvailable` (see openzfs/zfs#10255),
leading earlyoom to believe we are out of memory when we still have
a lot of memory available (in practice, many GBs).

Thus, until now, earlyoom tended to kill processes on ZFS systems
even though there was no memory pressure.

This commit fixes it by adding the `size` field of
`/proc/spl/kstat/zfs/arcstats` to `MemAvailable`.

The effect can be checked easily on ZFS systems:

Before this commit, dropping the ARC via (command from [1])

    echo 3 | sudo tee /proc/sys/vm/drop_caches

would result in an increase of free memory in earlyoom's output;
with this fix, it stays equal.

[1]: https://serverfault.com/a/857386/128321
@behlendorf behlendorf added the Component: Memory Management kernel memory management label Apr 28, 2020
@behlendorf
Copy link
Contributor

This is something we can take another look at. When this code was originally integrated with the kernel there were a variety of technical issues which prevented us from reporting the ARC space as page cache pages. Since then the Linux kernel has changed considerably, as has ZFS, so this may now be more feasible and is definitely worth re-investigating.

nh2 added a commit to nh2/earlyoom that referenced this issue May 1, 2020
The ZFS ARC cache is memory-reclaimable, like the Linux buffer cache.
However, in contrast to the buffer cache, it currently does not count
to `MemAvailable` (see openzfs/zfs#10255),
leading earlyoom to believe we are out of memory when we still have
a lot of memory available (in practice, many GBs).

Thus, until now, earlyoom tended to kill processes on ZFS systems
even though there was no memory pressure.

This commit fixes it by adding the `size` field of
`/proc/spl/kstat/zfs/arcstats` to `MemAvailable`.

The effect can be checked easily on ZFS systems:

Before this commit, dropping the ARC via (command from [1])

    echo 3 | sudo tee /proc/sys/vm/drop_caches

would result in an increase of free memory in earlyoom's output;
with this fix, it stays equal.

[1]: https://serverfault.com/a/857386/128321
@LifeIsStrange
Copy link

LifeIsStrange commented Nov 26, 2020

@behlendorf shouldn't this issue be prioritized / be considered as a blocker for the ZFS 2.0 release?
Rationale: preventing a system from being unusable/freezing when under low ram availability/heavy swapping is a virtue (and catching up with windows), virtue which many distros are currently looking at.
Fedora has integrated earlyoom by default, many end users use nohang and most importantly systemd is integrating oomd https://www.phoronix.com/scan.php?page=news_item&px=Systemd-247-Lands-OOMD
I am not an expert but shouldn't this issue be fixed before system freeze preventers become mainstream (which is imminent) ?

@deviantintegral
Copy link

I've also run into this issue booting KVM VMs. It's pretty confusing to have them fail to boot for a reason completely unrelated to KVM.

As a workaround, I manually dropped the size of the ARC cache and immediately expanded it after the VM had allocated it's memory.

ORIGINAL_LIMIT=$(cat /sys/module/zfs/parameters/zfs_arc_max)
# Drop to a 1GB ARC.
echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max
<boot your VM>
echo $ORIGINAL_LIMIT > /sys/module/zfs/parameters/zfs_arc_max

@neuronmaker
Copy link

Probably my own fault, but due to limited hardware my backup ZFS server is an old laptop with only 8 GB total system RAM which is shared with the integrated GPU. ZFS is the only thing that can run on it after a few minutes because other programs do not recognize the utilized memory as the cache memory it is. This causes every program to behave as if my system genuinely has 7.5/7.8 GB utilized. Quite annoying, and it will likely only get worse if I make a larger pool (I assume). Not so bad on my main server since it has more memory and also does not really run any graphical applications given it's headless. My backup machine though is airgapped to protect from power surges when I'm away and due to its location has terrible networking. It uses its graphical interface to read incremental backup data off the shuttle drives I use.

If this issue is fixed at some point that would make my boss reconsider using ZFS on our servers given how much he likes the snapshots and scrubbing functionality. Until the memory issues can be dealt with (ideally without capping the ARC size) the senior admin will lean heavily on using hardware RAID on Windows server. Not exactly a bad solution but I think ZFS has some advantages over traditional RAID and that Linux and UNIX both make decent hosts for virtualisation.

Ideally I would be planning to deploy ZFS as my boot partition or at least for my home and manually installed application folders once an acceptable solution to the RAM usage thing is found. Personally I don't want to limit the ARC size since I like the idea of using as much of my free RAM as possible as a cache just as long as my applications see it as cache and act normally.

@bghira
Copy link

bghira commented Dec 8, 2021

If this issue is fixed at some point that would make my boss reconsider using ZFS on our servers given how much he likes the snapshots and scrubbing functionality. Until the memory issues can be dealt with (ideally without capping the ARC size) the senior admin will lean heavily on using hardware RAID on Windows server. Not exactly a bad solution but I think ZFS has some advantages over traditional RAID and that Linux and UNIX both make decent hosts for virtualisation.

to be fair, this is not a problem the OpenZFS project has any say over. if they are the type who feel fine running Windows Server on a laptop with 8GB RAM (and somehow capping the ARC is not a solution?) then there's no problems. they seem to be satisfied with Windows Server, and it meets their needs.

@neuronmaker
Copy link

to be fair, this is not a problem the OpenZFS project has any say over. if they are the type who feel fine running Windows Server on a laptop with 8GB RAM (and somehow capping the ARC is not a solution?) then there's no problems. they seem to be satisfied with Windows Server, and it meets their needs.

It's me using the laptop due limited funds... that and it has only one job: store my backups. They're fine using no ZFS and Windows server (or real servers) becasue that is what the entire firm is familiar with. I will be trying the capping of the size of ARC myself but personally I like a large cache in the RAM just as long as the applications sharing that RAM see it as such. Other than my edge case, ZFS is a really good system, on my main machine which has a reasonable amount of memory ZFS has been the best decision I ever made due to unstable CPU causing crashes and ZFS so far being the most fault tolerant solution I have used to date. They do see to have issues though since they are considering giving me a Linux VM to try one of their services on to see if we could deploy Linux in the future, time will tell if the small Linux VM will be convincing.

@NHellFire
Copy link

Apologies for the noise, but has anyone had a chance to take another look at this?
It is pretty annoying for monitoring and does cause problems with some apps that check available memory (like qemu). I'm wondering if this also contributed to heavy swapping on one of my old servers that only had 32GB RAM.

@KenMacD
Copy link

KenMacD commented Mar 14, 2022

If you found this like I did—because Firefox was unloading tabs thinking there wasn't enough memory—then you may find the zfs_arc_sys_free module parameter useful. The man page describes it as:

The target number of bytes the ARC should leave as free memory on the system. Defaults to the larger of 1/64 of physical memory or 512K. Setting this option to a non-zero value will override the default.

I set mine to 3gb with echo 3221225472 | sudo tee /sys/module/zfs/parameters/zfs_arc_sys_free.

@ast0815
Copy link

ast0815 commented Mar 15, 2022

Thanks for that that one, @KenMacD. Using that parameter seems a bit more flexible than zfs_arc_max.

As a sidenote for people who want to make these changes permanent, follow the instructions here: https://www.cyberciti.biz/faq/how-to-set-up-zfs-arc-size-on-ubuntu-debian-linux/

Just don't forget the update of the initramfs, like I did. 🙄

@digitalsignalperson
Copy link

A solution here would be great.

Props to htop for reporting sensible free memory (counts shrinkable ARC as cache
htop-dev/htop@491c6f1)

My latest run-in is compiling UnrealEngine and getting nerfed parallel actions

Determining max actions to execute in parallel (16 physical cores, 16 logical cores)
  Executing up to 16 processes, one per physical core
  Requested 1.5 GB free memory per action, 10.03 GB available: limiting max parallel actions to 6

where in reality >40GB is available.

@zyansheep
Copy link

I'm running NixOS + ZFS w/ Framework 13 AMD Laptop, been dealing with random crashes for 2 weeks! (didn't know what was causing them). Just now saw in journalctl that it was earlyoom. (I even submitted a bug report to vscode lol).

Is anyone currently actively working on this issue? If so, is there anything I can do to help?

If not, I would love to delve in and try to fix this bug. Does anyone have any tips as to what docs / code I should read first?

@digitalsignalperson
Copy link

Is a solution even technically possible, or is it in the hands of the kernel?

@robn
Copy link
Member

robn commented Jan 14, 2024

Unfortunately it's not about where things are accounted to, as such. We don't get to choose that. Linux accounts a certain type of memory through MemAvailable (roughly, "page cache"). That's not the type of memory the ARC uses, for both historical and practical reasons.

Its unclear if it's possible to convert the ARC to use that kind of memory. I think at least a prototype is possible, but it's an enormous job, and I don't know if all the niches are possible. I wouldn't do it just to change where the memory is accounted to though; the main reason for me would be to reduce copies on mmap-heavy workloads. It's something I'd like to try but I'm extremely unlikely to get to it any time soon.

I gotta say though, I am slightly inclined to point the finger at earlyoom here, because it's not implementing the same logic as as the true OOM killer. OpenZFS registers clean up callbacks with the kernel; when it's under memory pressure the kernel will ask OpenZFS to make memory available, which it will do. If earlyoom is basing it's kill decisions solely on kernel memory stats, then it could be argued that its not precise enough.

It seems like it might be straightforward to patch it to include ARC size in its accounting. Has anyone asked about this? I didn't see an issue on the tracker.

@daniellandau
Copy link

As mentioned earlier in this issue, it's not just earlyoom, or just all daemons for early oom purpose that you'd need to patch for specific ARC handling.

@rowlap
Copy link

rowlap commented Jan 14, 2024

There was this earlyoom PR 191 for earlyoom, but it was closed due to lack of testing.

oomd makes use of PSI (pressure stall information).

We make use of earlyoom with custom ZFS-awareness patches, but would prefer everything to be happy upstream. Sure, projects can point to other projects doing the Wrong Thing, but the definition of MemAvailable is

An estimate of how much memory is available for starting new applications, without swapping.

When ARC is a) responsive to memory pressure and b) not accounted for in MemAvailable, it's hard for userspace to determine system health. "Should I trigger an OOM kill?" or "Is there enough memory to launch X?" or "Should Firefox unload this tab?" tend to rely on the MemAvailable abstration, and "fixing" one app doesn't help the others.

Would it be feasible to hook into the MemAvailable accounting (suggested by @behlendorf, April 2020)?

@robn
Copy link
Member

robn commented Jan 14, 2024

I don't disagree that having it all one place would be easier for tools that don't know the difference. What I'm saying is that its hard to do, maybe impossible. Wanting it more doesn't change that.

@devZer0
Copy link

devZer0 commented Jan 16, 2024

duplicate:
#15775
#10251

@bignay2000
Copy link

bignay2000 commented Jun 6, 2024

Any update on when this can be addressed?

I think this is causing my BeeLink GTR7PRO with 96 GB RAM Proxmox 8.2.2 with ZFS to crash about once a week when doing a vm backup.

zfs version
zfs-2.2.3-pve2
zfs-kmod-2.2.3-pve2

Logs right before Proxmox System kernel panicked:

root@gtr7pro:~# journalctl --since "2024-06-04 15:55" --until "2024-06-04 16:11"
Jun 04 15:58:54 gtr7pro pmxcfs[1295]: [status] notice: received log
Jun 04 15:58:54 gtr7pro pvedaemon[1438]: <hiveadmin@pam> starting task UPID:gtr7pro:00004D14:000AE00F:665F71FE:vzdump::hiveadmin@pam:
Jun 04 15:58:54 gtr7pro pvedaemon[19732]: INFO: starting new backup job: vzdump --mailnotification failure --mailto systems@example.com --compress zstd --prune-backups 'keep-last=3' --notes-template '{{guestname}}' --storage local --mode snapshot --all 1 --fleecing 0 --node gtr7pro
Jun 04 15:58:54 gtr7pro pvedaemon[19732]: INFO: Starting Backup of VM 102 (qemu)
Jun 04 15:58:57 gtr7pro pvedaemon[19732]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Jun 04 15:59:00 gtr7pro kernel: hrtimer: interrupt took 5490 ns
Jun 04 15:59:08 gtr7pro kernel: BUG: unable to handle page fault for address: 0000040000000430
Jun 04 15:59:08 gtr7pro kernel: #PF: supervisor read access in kernel mode
Jun 04 15:59:08 gtr7pro kernel: #PF: error_code(0x0000) - not-present page

zpool status -v returned errors around the virtual disk file for this VM 102. I was able to reboot and repair the pool and then start the VM.

@bignay2000
Copy link

@sjau Did you ever resolve your issue(s) with qemu/kvm requested ram?

@bignay2000
Copy link

This is something we can take another look at. When this code was originally integrated with the kernel there were a variety of technical issues which prevented us from reporting the ARC space as page cache pages. Since then the Linux kernel has changed considerably, as has ZFS, so this may now be more feasible and is definitely worth re-investigating.

Any idea if there is more discussions on low level ZFS and Linux? @behlendorf

@sjau
Copy link

sjau commented Jun 6, 2024

@sjau Did you ever resolve your issue(s) with qemu/kvm requested ram?

My solution was getting more ram ;)

@deviantintegral
Copy link

I recently had a server running Proxmox kill a VM due to an OOM. It was very quick though - none of my containers or VMs showed a memory increase, and my graphs only show a sharp increase in the ARC usage right before the OOM at 4:10 AM:

image

We can also see how the ARC was larger than before the OOM, which indicates it wasn't being constrained or freed properly.

This did occur during a regular scrub, but the scrub was started hours before. This is all with ZFS 2.2.3-pve2.

Looking into this, I found a note at https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage which indicated that Proxmox is now more aggressively limiting ARC memory by default on new installs, to no more than 10% of installed RAM or 16GB. I've done the same, but honestly this feels like a failure in the ARC that's just being masked.

@bignay2000
Copy link

bignay2000 commented Jun 13, 2024

I created /etc/modprobe.d/zfs.conf file and went with a 8 GB ZFS ARC Max.

options zfs zfs_arc_max=8589934592

Before this change here are the arc stats, definitely do not need 48 GB RAM (c_max) going to ZFS RAID 1 mirror (2 x 2 TB Samsung 990 Pro NVME Gen4 drives).

cat /proc/spl/kstat/zfs/arcstats 
9 1 0x01 147 39984 1405644657 619562335795635
name                            type data
hits                            4    138399132
iohits                          4    1032529
misses                          4    78349531
demand_data_hits                4    12679760
demand_data_iohits              4    938681
demand_data_misses              4    73437903
demand_metadata_hits            4    125444339
demand_metadata_iohits          4    381
demand_metadata_misses          4    2349
prefetch_data_hits              4    196397
prefetch_data_iohits            4    91915
prefetch_data_misses            4    4894859
prefetch_metadata_hits          4    78636
prefetch_metadata_iohits        4    1552
prefetch_metadata_misses        4    14420
mru_hits                        4    13530029
mru_ghost_hits                  4    136
mfu_hits                        4    124869103
mfu_ghost_hits                  4    69
uncached_hits                   4    0
deleted                         4    77569986
mutex_miss                      4    29004
access_skip                     4    93
evict_skip                      4    2
evict_not_enough                4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    1728974577152
evict_l2_eligible_mfu           4    45784064
evict_l2_eligible_mru           4    1728928793088
evict_l2_ineligible             4    843452416
evict_l2_skip                   4    0
hash_elements                   4    3496614
hash_elements_max               4    3772608
hash_collisions                 4    12684235
hash_chains                     4    317389
hash_chain_max                  4    6
meta                            4    1073687296
pd                              4    2147106211
pm                              4    2147483648
c                               4    48361080832
c_min                           4    3022567552
c_max                           4    48361080832
size                            4    48284312984
compressed_size                 4    45911926272
uncompressed_size               4    55513380352
overhead_size                   4    1464989696
hdr_size                        4    841842752
data_size                       4    46921547776
metadata_size                   4    455368192
dbuf_size                       4    37936584
dnode_size                      4    22260048
bonus_size                      4    4468800
anon_size                       4    342528
anon_data                       4    211456
anon_metadata                   4    131072
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    40709551616
mru_data                        4    40592649216
mru_metadata                    4    116902400
mru_evictable_data              4    38850707456
mru_evictable_metadata          4    65455616
mru_ghost_size                  4    3383918592
mru_ghost_data                  4    3383918592
mru_ghost_metadata              4    0
mru_ghost_evictable_data        4    3383918592
mru_ghost_evictable_metadata    4    0
mfu_size                        4    6667021824
mfu_data                        4    6328687104
mfu_metadata                    4    338334720
mfu_evictable_data              4    6002567168
mfu_evictable_metadata          4    231745536
mfu_ghost_size                  4    38369280
mfu_ghost_data                  4    38369280
mfu_ghost_metadata              4    0
mfu_ghost_evictable_data        4    38369280
mfu_ghost_evictable_metadata    4    0
uncached_size                   4    0
uncached_data                   4    0
uncached_metadata               4    0
uncached_evictable_data         4    0
uncached_evictable_metadata     4    0
l2_hits                         4    0
l2_misses                       4    0
l2_prefetch_asize               4    0
l2_mru_asize                    4    0
l2_mfu_asize                    4    0
l2_bufc_data_asize              4    0
l2_bufc_metadata_asize          4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_log_blk_writes               4    0
l2_log_blk_avg_asize            4    0
l2_log_blk_asize                4    0
l2_log_blk_count                4    0
l2_data_to_meta_ratio           4    0
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    0
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    103
memory_all_bytes                4    96722161664
memory_free_bytes               4    32631607296
memory_available_bytes          3    29310502784
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1361876376
arc_dnode_limit                 4    4836108083
async_upgrade_sync              4    938691
predictive_prefetch             4    5238512
demand_hit_predictive_prefetch  4    4167915
demand_iohit_predictive_prefetch 4    938768
prescient_prefetch              4    39267
demand_hit_prescient_prefetch   4    34373
demand_iohit_prescient_prefetch 4    162
arc_need_free                   4    0
arc_sys_free                    4    3321104512
arc_raw_size                    4    0
cached_only_in_progress         4    0
abd_chunk_waste_size            4    888832

After 'update-initramfs -u -k all' & reboot with new config file c_max now shows 8GB

cat /proc/spl/kstat/zfs/arcstats 
9 1 0x01 147 39984 1435582605 1120417395413
name                            type data
hits                            4    402710
iohits                          4    2081
misses                          4    73649
demand_data_hits                4    103782
demand_data_iohits              4    121
demand_data_misses              4    54500
demand_metadata_hits            4    294899
demand_metadata_iohits          4    359
demand_metadata_misses          4    2368
prefetch_data_hits              4    763
prefetch_data_iohits            4    1
prefetch_data_misses            4    16014
prefetch_metadata_hits          4    3266
prefetch_metadata_iohits        4    1600
prefetch_metadata_misses        4    767
mru_hits                        4    210825
mru_ghost_hits                  4    0
mfu_hits                        4    191885
mfu_ghost_hits                  4    0
uncached_hits                   4    0
deleted                         4    19
mutex_miss                      4    0
access_skip                     4    0
evict_skip                      4    2
evict_not_enough                4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    291840
evict_l2_eligible_mfu           4    0
evict_l2_eligible_mru           4    291840
evict_l2_ineligible             4    4096
evict_l2_skip                   4    0
hash_elements                   4    103082
hash_elements_max               4    103082
hash_collisions                 4    386
hash_chains                     4    301
hash_chain_max                  4    1
meta                            4    1073741824
pd                              4    2147483648
pm                              4    2147483648
c                               4    3022567424
c_min                           4    3022567424
c_max                           4    8589934592
size                            4    1562462544
compressed_size                 4    1395636736
uncompressed_size               4    2072109056
overhead_size                   4    127916032
hdr_size                        4    24865552
data_size                       4    1440855040
metadata_size                   4    82697728
dbuf_size                       4    4138344
dnode_size                      4    6983448
bonus_size                      4    2164160
anon_size                       4    135168
anon_data                       4    0
anon_metadata                   4    135168
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    1412013568
mru_data                        4    1358900736
mru_metadata                    4    53112832
mru_evictable_data              4    1266563072
mru_evictable_metadata          4    13077504
mru_ghost_size                  4    0
mru_ghost_data                  4    0
mru_ghost_metadata              4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    111404032
mfu_data                        4    81954304
mfu_metadata                    4    29449728
mfu_evictable_data              4    58868224
mfu_evictable_metadata          4    9779712
mfu_ghost_size                  4    0
mfu_ghost_data                  4    0
mfu_ghost_metadata              4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
uncached_size                   4    0
uncached_data                   4    0
uncached_metadata               4    0
uncached_evictable_data         4    0
uncached_evictable_metadata     4    0
l2_hits                         4    0
l2_misses                       4    0
l2_prefetch_asize               4    0
l2_mru_asize                    4    0
l2_mfu_asize                    4    0
l2_bufc_data_asize              4    0
l2_bufc_metadata_asize          4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_log_blk_writes               4    0
l2_log_blk_avg_asize            4    0
l2_log_blk_asize                4    0
l2_log_blk_count                4    0
l2_data_to_meta_ratio           4    0
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    0
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    96722157568
memory_free_bytes               4    88757563392
memory_available_bytes          3    85436459008
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    120849232
arc_dnode_limit                 4    858993459
async_upgrade_sync              4    125
predictive_prefetch             4    22388
demand_hit_predictive_prefetch  4    14120
demand_iohit_predictive_prefetch 4    193
prescient_prefetch              4    23
demand_hit_prescient_prefetch   4    16
demand_iohit_prescient_prefetch 4    7
arc_need_free                   4    0
arc_sys_free                    4    3321104384
arc_raw_size                    4    0
cached_only_in_progress         4    0
abd_chunk_waste_size            4    758272 

@bignay2000
Copy link

Thinking that a proper work around would be to configure the zfs_arc_max value and update a corresponding Linux boot parameter to set the max ram to be physical memory minus this zfs_arc_max amount?

Researching how to set max memory for Proxmox for Debian 12.

@behlendorf - Is this idea possible - would limiting the Linux RAM work or would their still be a memory type difference issue?

@bignay2000
Copy link

bignay2000 commented Jun 13, 2024

@ThomasLamprecht How does Proxmox handle this scenario?

My server started off as Proxmox 8.0 and upgraded via apt dist-upgrade to the lastest 8.2.2, however my server never got the /etc/modprobe.d/zfs.conf file from your bugzilla report 4829 that was fixed in the 8.1 installer.

I think my server was crashing when vm(s) were consuming memory (I have an ai vm that has a large language model, so it goes from using 1.5 GB Ram to over 64 GB RAM) and a VM backup occurred at the same time then Linux crashed with a Memory fault because it was not aware of this ZFS ARC memory?

Wonder if their is also a use case for Proxmox to change the default zfs_arc_sys_free configuration.

@ThomasLamprecht
Copy link
Contributor

ThomasLamprecht commented Jun 17, 2024

@bignay2000

My server started off as Proxmox 8.0 and upgraded via apt dist-upgrade to the lastest 8.2.2, however my server never got the /etc/modprobe.d/zfs.conf file from your bugzilla report 4829 that was fixed in the 8.1 installer.

Yeah, we strictly limit setting up our dynamic (lower) default ARC size on installation, as we do not want to mess with existing systems, as we cannot really differentiate if the admin explicitly wants to have ZFS use 50% or not.

I think my server was crashing when vm(s) were consuming memory (I have an ai vm that has a large language model, so it goes from using 1.5 GB Ram to over 64 GB RAM) and a VM backup occurred at the same time then Linux crashed with a Memory fault because it was not aware of this ZFS ARC memory?

Sudden usage memory spikes can indeed result in OOM if the memory cannot be reclaimed fast enough, albeit I'm not sure about the specifics w.r.t. interaction between kernel and ZFS here from top of my head.

Wonder if their is also a use case for Proxmox to change the default zfs_arc_sys_free configuration.

Not so sure about that, as the requirements for that value are very dependent on the actual use case. E.g., if VMs, or other huge memory workloads like your AI tool, are frequently started and stopped again it could be good to reserve the fluctuation by using zfs_arc_sys_free, but for more stable systems, where memory usage is staying mostly constant, doing so might even make overall system performance worse (free and unused memory is useless memory after all).

So, at least for Proxmox VE this is IMO something that should be handled by the administrators for their specific use case.
What I could imagine is mentioning that knob in our documentation, if you think that'd be useful then please open an enhancement request over at our bug and feature tracker: https://bugzilla.proxmox.com/

@bignay2000
Copy link

bignay2000 commented Jun 19, 2024

Update - After I configured /etc/modprobe.d/zfs.conf, my GTR7Pro AMD based Proxmox hypervisor has not crashed/hung. VM's and their applications have been running without issue. However, most backups of my main VM error. If I move this main VM to completely different hardware (an older Intel Host), then it successfully backups and runs without issue.

Today, I updated from Proxmox 8.2.2 to 8.2.4. Initial testing, I was able to successfully backup the main vm without an error after this update. I report back in a week to see if my issue is truly resolved.

Upgrade:

root@gtr7pro:~# uname -a
Linux gtr7pro 6.8.8-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-1 (2024-06-10T11:42Z) x86_64 GNU/Linux
root@gtr7pro:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.8-1
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.4-1
proxmox-backup-file-restore: 3.2.4-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1

@bignay2000
Copy link

bignay2000 commented Jun 25, 2024

Update: I have not had any ZFS related errors since updating to Proxmox v.8.2.4

@ThomasLamprecht Great work on Proxmox v.8.2.4, really appreciate your team.

System Log has been quiet for the last 4 days:

root@gtr7pro:~# journalctl --since "2024-06-20 00:00" | grep -i error
Jun 20 07:00:59 gtr7pro kernel: RPC: Could not send backchannel reply error: -110
Jun 21 07:00:55 gtr7pro kernel: RPC: Could not send backchannel reply error: -110
Jun 22 07:01:16 gtr7pro kernel: RPC: Could not send backchannel reply error: -110
Jun 22 07:04:47 gtr7pro kernel: showmount[970800]: segfault at 73de14df06f0 ip 000073de14df06f0 sp 00007ffdc7656878 error 14 likely on CPU 8 (core 0, socket 0)
Jun 22 07:06:07 gtr7pro kernel: RPC: Could not send backchannel reply error: -110
Jun 24 07:06:56 gtr7pro kernel: RPC: Could not send backchannel reply error: -110
Jun 24 17:19:44 gtr7pro pvestatd[1362]: metrics send error 'proxmox': 500 Can't connect to influxdb.example.net:8086 (Temporary failure in name resolution)
Jun 24 17:19:44 gtr7pro pmxcfs[1234]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/103: -1
Jun 24 17:19:44 gtr7pro pmxcfs[1234]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/100: -1
Jun 24 17:19:44 gtr7pro pmxcfs[1234]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/104: -1
Jun 24 17:19:44 gtr7pro pmxcfs[1234]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-vm/106: -1
Jun 24 17:19:44 gtr7pro pmxcfs[1234]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/gtr7pro/nas.example.net: -1
Jun 24 17:19:44 gtr7pro pmxcfs[1234]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/gtr7pro/local-zfs: -1
Jun 24 17:19:44 gtr7pro pmxcfs[1234]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/gtr7pro/local: -1

@satmandu
Copy link
Contributor

satmandu commented Oct 29, 2024

I've noticed ARC getting really large when I create a lot of Docker containers, which eventually causes OOM issues with Ubuntu 24.10.

My workaround is to have this in my /etc/rc.local:

# Set zfs_arc_max to half currently free memory or 8 Gb, whichever is lower.
half_free_mem_now="$(free -bL | awk '{print $8/2}')"
echo "Half free mem: $((half_free_mem_now / 1073741824)) Gb"
[[ $half_free_mem_now -lt 8589934592 ]] && echo "${half_free_mem_now}" > /sys/module/zfs/parameters/zfs_arc_max || echo 8589934592 > /sys/module/zfs/parameters/zfs_arc_max
echo "zfs_arc_max => $(($(cat /sys/module/zfs/parameters/zfs_arc_max) / 1073741824)) Gb"
zfs_arc_size="$(awk '/^size/ { print $1 " " $3 / 1073741824 }' /proc/spl/kstat/zfs/arcstats)"
echo "ZFS ARC ${zfs_arc_size} Gb"

If I find I need to dramatically reduce arc, I also run this:

sudo sync && echo 2 | sudo tee /proc/sys/vm/drop_caches

I've also been using arc_summary and sudo slabtop -o for additional debugging.

I added this to my ~/.bashrc:

zfs_arc_size="$(awk '/^size/ { print $1 " " $3 / 1073741824 }' /proc/spl/kstat/zfs/arcstats)"
echo "ZFS ARC ${zfs_arc_size} Gb"
smem -t -k -w

This gives me a good sense of how things are:

ZFS ARC size 3.78197 Gb
Area                           Used      Cache   Noncache
firmware/hardware                 0          0          0
kernel image                      0          0          0
kernel dynamic memory          6.0G     869.9M       5.2G
userspace memory               6.3G       2.0G       4.4G
free memory                   33.3G      33.3G          0
----------------------------------------------------------
                              45.6G      36.1G       9.5G

The ARC is in Noncache kernel dynamic memory. These steps keep everything working smoothly for me, at the expense of not using unused RAM for ARC when it could be helpful.

@amotin
Copy link
Member

amotin commented Oct 31, 2024

@satmandu You haven't specified ZFS version you are using, but in 2.3 I've made few changes to make ZFS more responsive to kernel memory requests. It should also report reclaimable memory to the kernel, which might be visible to some external things, like memory balloon drivers, etc.

@satmandu
Copy link
Contributor

satmandu commented Oct 31, 2024

@satmandu You haven't specified ZFS version you are using, but in 2.3 I've made few changes to make ZFS more responsive to kernel memory requests.

Apologies! I am using OpenZFS 2.3.0-rc2, from which I am generating a PPA here: https://launchpad.net/~satadru-umich/+archive/ubuntu/zfs-experimental

The issues cropped up after updating to Ubuntu 24.10. I wasn't having issues with the 2.3.0 rc series on ubuntu 24.04, I think.

(And this is on either 6.11.x or 6.12.0-rc kernels.)

@amotin
Copy link
Member

amotin commented Oct 31, 2024

@satmandu Try to set zfs_arc_shrinker_limit=0, as we do in TrueNAS. I don't like this parameter, even though it was made less restrictive in 2.3.

@satmandu
Copy link
Contributor

satmandu commented Nov 1, 2024

@amotin Setting zfs_arc_shrinker_limit=0 appears to entirely handle my issues, and I'm no longer getting OOM errors despite a larger ARC size.

Thank you for the suggestion:

ZFS ARC size 33.2544 Gb
Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory         36.1G       3.6G      32.5G 
userspace memory               5.3G     582.8M       4.7G 
free memory                    4.2G       4.2G          0 
----------------------------------------------------------
                              45.6G       8.4G      37.2G 

@amotin
Copy link
Member

amotin commented Nov 1, 2024

@shodanshok ^^^

@shodanshok
Copy link
Contributor

Apologies! I am using OpenZFS 2.3.0-rc2, from which I am generating a PPA here: https://launchpad.net/~satadru-umich/+archive/ubuntu/zfs-experimental

Can I ask if they are "true" kernel OOM (with dmesg showing related info) or if something as earlyoom was used?

#16313 should ignore zfs_arc_shrinker_limit when kernel asks for direct reclaim, with effective behavior of zfs_arc_shrinker_limit=0

@satmandu
Copy link
Contributor

satmandu commented Nov 2, 2024

Apologies! I am using OpenZFS 2.3.0-rc2, from which I am generating a PPA here: https://launchpad.net/~satadru-umich/+archive/ubuntu/zfs-experimental

Can I ask if they are "true" kernel OOM (with dmesg showing related info) or if something as earlyoom was used?

#16313 should ignore zfs_arc_shrinker_limit when kernel asks for direct reclaim, with effective behavior of zfs_arc_shrinker_limit=0

I believe that earlyoom had already been uninstalled at that point in my debugging.

sudo apt remove earlyoom
Package 'earlyoom' is not installed, so not removed
Summary:
  Upgrading: 0, Installing: 0, Removing: 0, Not Upgrading: 1

Sample from dmesg when I was having these issues.

oom.dmesg.txt

Happy to disable zfs_arc_shrinker_limit and try the steps I was having to cause these oom messages again though...

@satmandu
Copy link
Contributor

satmandu commented Nov 3, 2024

@shodanshok without zfs_arc_shrinker_limit=0 set I am easily getting OOMs...

From the current dmesg:

dmesg.newoom.txt

@shodanshok
Copy link
Contributor

@satmandu thanks for your tests. From the log, I can see all OOM are due to kswapd:

# grep invoked dmesg.newoom.txt
[ 2632.101013] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2632.147981] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2632.582976] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2632.717577] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2632.720095] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2632.800258] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2632.804544] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2632.824772] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[ 2633.105465] kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

It is somewhat surprising, as legacy kernel behavior was to try direct reclaims when kswapd reclaimed too little memory. However, it seems that newer MGLRU-enabled kernel have a new behavior meant to avoid trashing the current working set when kswapd reclaims memory. From kernel doc:

Users can write N to min_ttl_ms to prevent the working set of N milliseconds from getting evicted. The OOM killer is triggered if this working set cannot be kept in memory

As I understand it, this means that if kswapd can not free enough memory fast enough to avoid touching the current working set, it will invoke OOM. As zfs_arc_shrinker_limit throttle kswapd reclaims, even with #16313 applied, OOM is invoked.

If you are willing to do some more tests:

  • does your kernel uses MGLRU (cat /sys/kernel/mm/lru_gen/enabled returns non-zero)?
  • if so, does disabling it (leaving zfs_arc_shrinker_limit at default) changes something?
  • does leaving MGLRU enabled (and zfs_arc_shrinker_limit at default) but with min_ttl_ms=0 avoid the issue?

Thanks.

@satmandu
Copy link
Contributor

satmandu commented Nov 3, 2024

Looks like I do have it enabled:

cat /sys/kernel/mm/lru_gen/enabled
0x0007

I'll see what happens when I disable it...

@satmandu
Copy link
Contributor

satmandu commented Nov 3, 2024

Both disabling MGLRU with echo 0 > /sys/kernel/mm/lru_gen/enabled and (independently) doing echo 0 > /sys/kernel/mm/lru_gen/min_ttl_ms resolve the OOM issues for me.

@Soukyuu
Copy link

Soukyuu commented Nov 3, 2024

First ZFS pool ever created by me and I've hit the problem consistently trying to move my data from root to a dataset. Still getting used to how ZFS works, but that was very unsettling, because the oomkiller killed my cp process running via ssh.

After setting min_ttl_msto 0, things work fine again. This is on a 8GB RAM system without any swap.

@Soukyuu
Copy link

Soukyuu commented Nov 3, 2024

Scratch that. Just got oomkilled again. Attempting with MGLRU disabled

@Soukyuu
Copy link

Soukyuu commented Nov 3, 2024

And no dice. Got oomkiller even faster. I kind of have a feeling it's not respecting my arc size...

edit: I'm on 2.2.6 btw, so maybe I'll have to wait for 2.3 for things to get better? I'm actually surprised with how badly it performs out of the box for me, since ZFS has a reputation of being very stable (unlike, say, btrfs where I'm coming from)

@satmandu
Copy link
Contributor

satmandu commented Nov 8, 2024

I eventually ended up with a lockup when I had echo 0 > /sys/kernel/mm/lru_gen/min_ttl_ms set.

I switched to echo 0 > /sys/module/zfs/parameters/zfs_arc_shrinker_limit and then I got an OOM overnight.

dmesg:
arc_shrinker_limit_zero.dmesg.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests