pvf: Log memory metrics from preparation #6565

mrcnski · 2023-01-16T23:17:34Z

PULL REQUEST

Overview

This is a first step at mitigating disputes caused by OOM errors. Eventually, we would like to reject PVF's that surpass some memory threshold during compilation (preparation), while still in the pre-checking stage. We are not sure what the threshold should be at this time, so we are just getting data for now.

In particular, there are three measurements that seem promising:

max_rss (resident set size) from getrusage
resident memory stat provided by jemalloc
allocated memory stat also from jemalloc.

All have pros and cons, described in detail in the related issues.

See paritytech/polkadot-sdk#745 for more background, particularly this comment. Also see this issue starting at this comment.

Related issues

Closes #6317
See paritytech/polkadot-sdk#745
More background: https://github.com/paritytech/srlabs_findings/issues/110#issuecomment-1362822432

TODO

I need a little bit of help testing this. Is it possible to inspect the Metrics struct to see what has been logged? Mainly I'm interested to know if this is a good approach, and if there are existing tests that do something like this.

mrcnski · 2023-01-16T23:39:03Z

node/core/pvf/src/host.rs

@@ -845,7 +851,7 @@ mod tests {
 		let pulse = pulse_every(Duration::from_millis(100));
 		futures::pin_mut!(pulse);

-		for _ in 0usize..5usize {
+		for _ in 0..5 {


Sorry for the unrelated change, but this really triggered me.

eskimor · 2023-01-25T20:06:24Z

node/core/pvf/src/prepare/memory_stats.rs

+//! - `allocated` memory stat also from `tikv-malloc-ctl`.
+//!
+//! Currently we are only logging these, and only on each successful pre-check. In the future, we
+//! may use these stats to reject PVFs during pre-checking. See


Actually as in this initial phase, the goal is gathering data - we might extend this measurement to all preparation jobs for the time being. Reason: Otherwise we only get data on newly registered PVFs ... meaning if parachains are stable (no upgrades), it could take a while until we gathered enough data.

Ooh good point.

eskimor · 2023-01-26T11:49:44Z

node/core/pvf/src/prepare/memory_stats.rs

+		//
+		// 2. To have less potential loss of precision when converting to `f64`. (These values are
+		// originally `usize`, which is 64 bits on 64-bit platforms).
+		let resident_kb = (tracker_stats.resident / 1000) as f64;


I don't get the issue with precision loss. Floating point does not really care about the scale as long as it fits in the exponent.

Also to double check - is it really kilo as in 1000 or is it 1024?

Oh yeah, it should be 1024. And I'll just remove the comment to avoid confusion.

mrcnski · 2023-01-26T13:00:09Z

By using jemalloc for the memory stats we will cause this user's builds to break again. I don't think we should delay this PR for that reason, but IMO we should start thinking about gating jemalloc behind a feature flag (as proposed on that issue), especially now that we are expanding its usage.

eskimor · 2023-01-26T13:54:33Z

I need a little bit of help testing this. Is it possible to inspect the Metrics struct to see what has been logged? Mainly I'm interested to know if this is a good approach, and if there are existing tests that do something like this.

You should be able to gather metrics by running it via zombienet.

eskimor · 2023-01-26T14:11:40Z

Closes paritytech/polkadot-sdk#745

That does not seem to be quite right. (Just a step towards)

sandreim · 2023-01-26T15:28:17Z

node/core/pvf/src/prepare/memory_stats.rs

+/// For simplicity, any errors are returned as a string. As this is not a critical component, errors
+/// are used for informational purposes (logging) only.
+pub fn memory_tracker_loop(finished_rx: Receiver<()>) -> Result<MemoryAllocationStats, String> {
+	const POLL_INTERVAL: Duration = Duration::from_millis(10);


Do we really need to sample the stats so often ?

Oh, probably not. I was thinking of execution, where jobs take 10-25 ms.

Maybe 500ms for preparation is fine? I suspect that in most cases the memory will just grow, and the final measurement will be the max, so polling too often wouldn't buy much.

I'm wondering now if an attacker could craft a PVF that causes compiler memory to spike very quickly and then go back down. A coarse interval wouldn't catch that. Sounds like a very specific and unlikely attack though, and anyway the max_rss metric would be a useful backup stat for that case.

That is definitely a concern. It should not be possible to bypass the pre-checking checks.

You mean that kind of attack @eskimor? Maybe the interval could be randomized to be less predictable and thus less gameable?

I am dubious that could work: The sampling interval will likely stay in the milliseconds range, but allocating huge amounts of memory can be accomplished way faster. Therefore it would still be able to trick this. What we could do on top is track the overall amount memory ever allocated - and if that value changed "too much" during two samples, we could also mark the PVF as invalid.

Anyhow, we are talking about the PVF preparation here. So this is about crafting a PVF that makes the compiler use huge amounts of memory for only a very short while. Given that the compiler code is not controlled by an attacker, this might not even be possible.

if that value changed "too much" during two samples, we could also mark the PVF as invalid.

Interesting idea. Why not just check max_rss at the end -- it should give us the maximum memory spike as well. If that's too high, we reject, even if the memory tracker observed metrics in the normal range.

Given that the compiler code is not controlled by an attacker, this might not even be possible.

Yep, I wondered the same thing above. Certainly for the purposes of gathering metrics right now it's not a concern. And later, maybe we can just rely on max_rss, if we find out it's not a useless stat.

I think a better way to handle this would be to not poll and instead set up a cgroup with an appropriate limit and subscribe to an event once that limit is breached.

Indeed, thanks! I'll extract that suggestion into a follow-up issue. I'll keep the polling for now, for the purposes of gathering metrics, as we don't yet know what the limit should be.

eskimor · 2023-01-27T15:21:13Z

Let's run this on Versi and then merge.

Not sure why I did that -- was a brainfart on my end.

mrcnski · 2023-02-02T18:24:37Z

I found out that max_rss does actually work on MacOS, and was able to test it locally with zombienet and got a metric:

https://pastebin.com/raw/CH41UfHt

However, I had to use RUSAGE_SELF, as RUSAGE_THREAD is not supported. Anyway, I'll be pushing a change enabling this metric on Linux and Mac but not Windows.

mrcnski · 2023-02-02T18:26:31Z

On Versi the metrics are:

max_allocated: ~80mb
max_resident: ~90mb
max_rss: ~1gb (!)

koute · 2023-02-03T06:44:14Z

On Versi the metrics are:

max_allocated: ~80mb

max_resident: ~90mb

max_rss: ~1gb (!)

All of these seem kind of low, but I guess if this is only for the PVF precheck then it could make sense. And RSS being significantly higher than metrics returned by jemalloc is perfectly normal.

Would be interesting to do some more detailed profiling of this if/when I find the time.

…r MacOS" This reverts commit becf7a8.

mrcnski · 2023-02-06T10:20:47Z

I found out that max_rss does actually work on MacOS [...] However, I had to use RUSAGE_SELF, as RUSAGE_THREAD is not supported.

Oh, I realized that RUSAGE_SELF can't work, because the same process is responsible for multiple jobs. And I couldn't find any way to "reset" the max_rss between jobs. (Edit: I found this but again, it's Linux-only.) So I'll revert this change again (I guess we just can't support MacOS), run a quick zombienet test for sanity, and then merge.

mrcnski · 2023-02-06T11:17:13Z

bot merge

mrcnski added 2 commits January 15, 2023 17:05

Add getrusage and memory tracker for precheck preparation

6378f35

Log memory stats metrics after prechecking

488bb06

mrcnski requested review from eskimor, slumber and s0me0ne-unkn0wn January 16, 2023 23:17

github-actions bot added the A0-pleasereview label Jan 16, 2023

Fix tests

61ebe72

mrcnski force-pushed the mrcnski/prechecking-threshold-memory branch from c327b91 to 61ebe72 Compare January 16, 2023 23:31

mrcnski commented Jan 16, 2023

View reviewed changes

mrcnski added 3 commits January 16, 2023 18:40

Try to fix errors (linux-only so I'm relying on CI here)

8211891

Try to fix CI

f625988

Add module docs for prepare/memory_stats.rs; fix CI error

85f7a85

the-right-joyce removed the A0-pleasereview label Jan 20, 2023

eskimor reviewed Jan 26, 2023

View reviewed changes

Report memory stats for all preparation jobs

ddd15c7

mrcnski requested a review from eskimor January 26, 2023 13:13

eskimor approved these changes Jan 26, 2023

View reviewed changes

sandreim approved these changes Jan 26, 2023

View reviewed changes

mrcnski added 4 commits January 30, 2023 12:54

Merge branch 'master' into mrcnski/prechecking-threshold-memory

a1f3e1e

Use RUSAGE_SELF instead of RUSAGE_THREAD

4b25b8f

Not sure why I did that -- was a brainfart on my end.

Revert last commit (RUSAGE_THREAD is correct)

d995d50

Use exponential buckets

3865da1

Use RUSAGE_SELF for getrusage; enable max_rss metric for MacOS

becf7a8

mrcnski added 4 commits February 3, 2023 09:45

Increase poll interval

73ed92e

Merge branch 'master' into mrcnski/prechecking-threshold-memory

d37cfdd

Merge branch 'master' into mrcnski/prechecking-threshold-memory

6611818

Revert "Use RUSAGE_SELF for getrusage; enable max_rss metric fo…

4be8b21

…r MacOS" This reverts commit becf7a8.

mrcnski changed the title ~~pvf: Log memory metrics from prechecking~~ pvf: Log memory metrics from preparation Feb 6, 2023

paritytech-processbot bot merged commit 4f331d7 into master Feb 6, 2023

paritytech-processbot bot deleted the mrcnski/prechecking-threshold-memory branch February 6, 2023 11:17

This was referenced Feb 6, 2023

Introduce jemalloc-allocator feature flag #6675

Merged

Remove polling from PVF preparation memory tracker paritytech/polkadot-sdk#719

Open

kiltbot mentioned this pull request Apr 3, 2023

[AUTOMATIC] Update Polkadot dependencies from 0.9.38 to 0.9.39 KILTprotocol/kilt-node#495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pvf: Log memory metrics from preparation #6565

pvf: Log memory metrics from preparation #6565

mrcnski commented Jan 16, 2023 •

edited

Loading

mrcnski Jan 16, 2023

eskimor Jan 25, 2023

mrcnski Jan 26, 2023

eskimor Jan 26, 2023

mrcnski Jan 26, 2023

mrcnski commented Jan 26, 2023

eskimor commented Jan 26, 2023

eskimor commented Jan 26, 2023

sandreim Jan 26, 2023

mrcnski Jan 26, 2023 •

edited

Loading

eskimor Jan 27, 2023

mrcnski Jan 27, 2023

eskimor Jan 29, 2023

mrcnski Jan 29, 2023

koute Feb 1, 2023

mrcnski Feb 3, 2023

eskimor commented Jan 27, 2023

mrcnski commented Feb 2, 2023

mrcnski commented Feb 2, 2023

koute commented Feb 3, 2023

mrcnski commented Feb 6, 2023 •

edited

Loading

mrcnski commented Feb 6, 2023

pvf: Log memory metrics from preparation #6565

pvf: Log memory metrics from preparation #6565

Conversation

mrcnski commented Jan 16, 2023 • edited Loading

PULL REQUEST

Overview

Related issues

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski commented Jan 26, 2023

eskimor commented Jan 26, 2023

eskimor commented Jan 26, 2023

Choose a reason for hiding this comment

mrcnski Jan 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskimor commented Jan 27, 2023

mrcnski commented Feb 2, 2023

mrcnski commented Feb 2, 2023

koute commented Feb 3, 2023

mrcnski commented Feb 6, 2023 • edited Loading

mrcnski commented Feb 6, 2023

mrcnski commented Jan 16, 2023 •

edited

Loading

mrcnski Jan 26, 2023 •

edited

Loading

mrcnski commented Feb 6, 2023 •

edited

Loading