-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure pulls are reliable with lower resources (memory, network speed) #2732
Comments
Interesting specs. What values did you try for e.g. |
The three values I've been changing are The write requests limit was changed to 3 in a release soon after 2020.3 (I've gone through everything up to and including 2022.2 - I think it was 2022.3 that went to a newer GLib version (and I'm running a very old uclibc build so updating GLib is going to be a long process). I'll try the --disable-static-deltas also. I have lots of logs including the /var/log/messages from the oom reaper. Thanks! |
I actually ran into problems with the FETCHER_REQUESTS 3 setting. 4 or 5 seems to work better, 3 will sometimes cause errors and early termination. |
Instrumenting the code has given me some insight into the problem. It's actually not a problem which is endemic to ostree but in this case ostree definitely exacerbates the problem. |
I've finished my investigation and concluded that there are a number of conditions required to manifest this particular problem:
The oom-killer death is not directly related to anything ostree is doing, but because fetch requests pile up quickly in the queue while larger fetches are taking place, we seem to get enough heap fragmentation that uncommitted pages get pulled in and the I/O activity causes available memory to finish very quickly. The attached patch applies to 2020.3 and works around the problem in such extreme cases by implementing three options: It's really a horrendous kluge but allows me to run to completion without playing oom-killer roulette. I doubt anyone else has run into this sort of extreme case and there's probably a more elegant way to handle blocking on large requests. I am limited to using 2020.3 due to the signing methods available but I'm open to suggestions on the best way to make this available. In any case here is the patch. |
Yeah, we clearly need backpressure here. Your patch is AFAICS implementing this by blocking the main thread, which probably works well enough for unattended device updates but anything linking libostree and doing anything else on that thread may be surprised at blocking. (Specifically e.g. on a tty we'll stop rendering fetch progress) Hmm. This issue is really an extension of the problems with enqueuing too many http requests that caused us to add queuing for the HTTP fetching in c18628e Do you have a handy reproduction scenario for this? Does
help? |
We added backoff/queueing for fetching via HTTP, but we have another queue in the metadata scanning which can also grow up to the number of outstanding objects, which can be large. Capping the scanning operation when we have hit our operation limit will avoid potentially large amounts of allocations in the case of e.g. a slow network. Closes: ostreedev#2732
OK filed as #2766 |
Thanks for the feedback, I wanted to follow up to say that I'm going to have a look at the suggested changes. |
There is an article online that suggests ostree memory usage can be quite high: https://witekio.com/de/blog-de/ostree-tutorial-system-updates/ do we have any hard data in a blog post that describes what kind of memory is typically required in an el9 ostree-based system for upgrades? If no, should we write one? |
We don't have any hard data. In #2766 I found what I believe is one source of memory use proportional to O(files). I think we can repurpose this issue for gathering data and metrics, and ideally scaling our queues roughly to available resources (CPU, network speed, I/O bandwidth, memory). We definitely don't want any unbounded queues. |
First unusual event: ostree pull (using libcurl, https) is reaped by the oom-killer on a very large delta pull.
Kernel version 4.9.217 mips
Total available physical memory 85mb (physical DRAM is 128mb, minus a large CMA chunk, minus kernel and wpa_supplicant).
For the most part, I am able to successfully run ostree pull. However, if the total quantity of deltas is high, ostree is likely to use up all available memory and get reaped by the oom-killer.
This happens less frequently if the thread limits in ostree-repo-private.h are set to lower values, but is still guaranteed to happen on a large set of deltas. I am currently investigating running under ulimit as a workaround but have to ask if there might be a better way to avoid this?
In the usual case I get an errorlevel 137 in a script and can retry the ostree pull repeatedly until it succeeds. In the worst case I will get an oom-killer murder spree that begins terminating every process on the system, and while looking for victims an I/O thread fails and results in ubifs corruption, resulting in /ostree left in a read-only state. In short, I need to avoid oom-killer deaths. I have very little memory on this system and enough NAND to pull the update (but not enough for a second rootfs).
The text was updated successfully, but these errors were encountered: