fix(l0_flush): drops permit before fsync, potential cause for OOMs #8327

problame · 2024-07-09T16:24:30Z

Problem

Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1720511577862519

We're seeing OOMs in staging on a pageserver that has l0_flush.mode=Direct enabled.

There's a strong correlation between jumps in maxrss_kb and pageserver_timeline_ephemeral_bytes, so, it's quite likely that l0_flush.mode=Direct is the culprit.

Notably, the expected max memory usage on that staging server by the l0_flush.mode=Direct is ~2GiB but we're seeing as much as 24GiB max RSS before the OOM kill.

One hypothesis is that we're dropping the semaphore permit before all the dirtied pages have been flushed to disk. (The flushing to disk likely happens in the fsync inside the .finish() call, because we're using ext4 in data=ordered mode).

Summary of changes

Hold the permit until after we're done with .finish().

github-actions · 2024-07-09T17:18:38Z

3042 tests run: 2927 passed, 0 failed, 115 skipped (full report)

Flaky tests (3)

Postgres 16

test_tenant_creation_fails: debug

Postgres 15

test_subscriber_restart: release

Postgres 14

test_subscriber_restart: release

Code coverage* (full report)

functions: 32.6% (6940 of 21281 functions)
lines: 50.0% (54561 of 109077 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
82057ca at 2024-07-09T17:18:37.651Z :recycle:}

…8327) ## Problem Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1720511577862519 We're seeing OOMs in staging on a pageserver that has l0_flush.mode=Direct enabled. There's a strong correlation between jumps in `maxrss_kb` and `pageserver_timeline_ephemeral_bytes`, so, it's quite likely that l0_flush.mode=Direct is the culprit. Notably, the expected max memory usage on that staging server by the l0_flush.mode=Direct is ~2GiB but we're seeing as much as 24GiB max RSS before the OOM kill. One hypothesis is that we're dropping the semaphore permit before all the dirtied pages have been flushed to disk. (The flushing to disk likely happens in the fsync inside the `.finish()` call, because we're using ext4 in data=ordered mode). ## Summary of changes Hold the permit until after we're done with `.finish()`.

fix(l0_flush): drops permit before fsync, potential cause for OOMs

82057ca

problame requested a review from jcsp July 9, 2024 16:24

problame requested a review from a team as a code owner July 9, 2024 16:24

problame mentioned this pull request Jul 9, 2024

bypass PageCache for L0 flush #7418

Closed

jcsp approved these changes Jul 9, 2024

View reviewed changes

problame enabled auto-merge (squash) July 9, 2024 16:38

problame merged commit 9bb16c8 into main Jul 9, 2024
65 checks passed

problame deleted the problame/l0flush-extend-permit-lifetime branch July 9, 2024 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(l0_flush): drops permit before fsync, potential cause for OOMs #8327

fix(l0_flush): drops permit before fsync, potential cause for OOMs #8327

problame commented Jul 9, 2024

github-actions bot commented Jul 9, 2024

Postgres 16

Postgres 15

Postgres 14

fix(l0_flush): drops permit before fsync, potential cause for OOMs #8327

fix(l0_flush): drops permit before fsync, potential cause for OOMs #8327

Conversation

problame commented Jul 9, 2024

Problem

Summary of changes

github-actions bot commented Jul 9, 2024

3042 tests run: 2927 passed, 0 failed, 115 skipped (full report)

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)