Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-checking: reject PVFs where preparation surpassed threshold memory usage #745

Open
mrcnski opened this issue Dec 22, 2022 · 4 comments
Assignees

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Dec 22, 2022

ISSUE

Overview

This is an attempt at mitigating disputes caused by OOM errors.

As @eskimor described here:

We do retry now on OOM errors, but that does not really resolve the issue. In any case, this should be resolved via changes to pre-checking. E.g. by not accepting a PVF where the preparation consumed more than X amount of memory. Where X is low enough that it should work on any validator which adheres to spec recommendations.

There are two open questions I have:

Strategy

We are already using tikv-jemalloc for all processes. We should be able to use tikv-jemalloc-ctl to get memory stats in the child worker, as we do here.

I'm thinking we would need another thread that runs in the background and repeatedly polls the memory usage during preparation. At the end, the thread would return the peak observed memory usage.

Question: I'm guessing the allocated stat is what we need here, though it's not well-documented.

Memory threshold

Referring to the quote from the overview:

not accepting a PVF where the preparation consumed more than X amount of memory. Where X is low enough that it should work on any validator which adheres to spec recommendations.

Question: What should we use as the value of X? What are the spec recommendations?

@mrcnski mrcnski self-assigned this Dec 22, 2022
@mrcnski
Copy link
Contributor Author

mrcnski commented Dec 22, 2022

@eskimor Just to clarify something: this doesn't address OOM due to execution (i.e. AmbiguousWorkerDeath), right?

We do retry now on OOM errors, but that does not really resolve the issue. In any case, this should be resolved via changes to pre-checking.

This kind of implies OOM due to execution, as we only retry on AmbiguousWorkerDeath, but the proposed mitigation is only for OOM due to preparation, unless I'm missing something. At first I thought that this would supersede #767.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 12, 2023

What are the spec recommendations?

I found them here.

For the value of X in the ticket (threshold memory usage), does it make sense to set it really liberally, and get metrics on real usage in the meantime?


@eskimor After the discussion in the other issue, it seems to me that the strategy as described here is still viable. The extra thread would only be needed during pre-checking, and (I think) we would get more accurate statistics than from max_rss. Do you think I can get started soon, or is more discussion needed?

@eskimor
Copy link
Member

eskimor commented Jan 13, 2023

I don't think much more discussion is needed. We can definitely start gathering data and start with a high limit. If we limit max_rss during pre-checking enough that OOM is virtually impossible if adhered to, then I believe all issues are resolved.

So in summary:

  1. We gather data on actual memory usage.
  2. We limit max_rss during prechecking to something significantly lower than the memory that should be available - there should be at least a factor of 10.
  3. If preparation runs out of memory during preparing, we can then be very sure that it is a transient issue and retrying will very likely resolve it.

@eskimor
Copy link
Member

eskimor commented Jan 13, 2023

Ok, sorry the actual measurement of used memory is still not clear.

The two most promising options so far: Reading rss usage (should be physical RAM usage) or tracking actual allocations. Both are not perfect:

Measuring RSS

Measuring RSS is not ideal if nodes have swap enabled as then they would be allowed to use more "memory", than nodes not having swap enabled. Also measurement will be more platform dependent. The risk being: Nodes have swap enabled and are under memory pressure, so they are indeed swapping, hence they would allow more allocation than the limit is supposed to allow.

I am having trouble to believe that this will ever be a problem in practice though. It would mean that a significant number of validators have:

  1. swap enabled
  2. are under extreme memory pressure

memory pressure has to be pretty high, because I expect the preparation process to be quite short lived, so it is unlikely to get swapped out. Especially if we have a rather low limit, then it is even more likely that the limit will hit, before swapping can occur.

Tracking allocations/free

Sounds great, would avoid the swap usage problem, but suffers from a different problem: Allocations are allocating purely virtual memory, whether or not those pages become physical memory depends on the usage.

The delta is very application dependent. For many applications it will likely be negligible. Especially because it is per page. So it does not matter that you allocate a vec of 100 bytes and only use ten. The effect will only be noticeable if you have memory regions in the ballpark of pages. E.g. a vec of size 3k, but you only end up using 1k ... those things will make a difference.

Something else to consider: The usage depends on the compiler code and can not be influenced by an adversary and even if an attacker managed to make the compiler allocate lots of virtual only memory - it would just make its own code rejected. So no value in that. What would be more troublesome, if honest code just hits an edge case causing the compiler to allocate lots of virtual memory, than it would get rejected for no good reason.

Summary and Solution

Tracking RSS might underestimate, if swap is used. Tracking allocations might overestimate if lots of memory gets allocated, but not fully used.

Over estimation sounds safer, if actual usage by the compiler does not lead to too much virtual only allocation, this might be fine! But it also might not, depending on the actual implementation we could have an overestimate being too high to be useful.

Moving forward: Let's get data! I would implement measurement of at least one of the two options, or if easy enough even both and get metrics for them. Given that both metrics are not perfect, having both seems like a good idea to me to sanity check either measurement.

The one delivering more sensible results can then be used for limiting, but I would still keep the other measurement around. If we ever have problems with the limit, we would then have another data point that could help a great deal to figure out what is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Status: In Progress
Development

No branches or pull requests

4 participants