Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: direct IO for Pageserver #8240

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

RFC: direct IO for Pageserver #8240

wants to merge 10 commits into from

Conversation

problame
Copy link
Contributor

@problame problame commented Jul 2, 2024

Copy link

github-actions bot commented Jul 2, 2024

3042 tests run: 2925 passed, 2 failed, 115 skipped (full report)


Failures on Postgres 15

  • test_pageserver_metrics_removed_after_detach: debug

Failures on Postgres 14

  • test_pg_regress[4]: debug
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_pg_regress[debug-pg14-4] or test_pageserver_metrics_removed_after_detach[debug-pg15]"
Flaky tests (2)

Postgres 14

  • test_pg_regress[4]: debug
  • test_tenant_creation_fails: debug

Test coverage report is not available

The comment gets automatically updated with the latest test results
02ce39f at 2024-07-09T09:45:25.659Z :recycle:

2. all indirect blocks (=disk btree blocks) are cached in the PS `PageCache`.
The norm will be very low baseline replacement rates in PS `PageCache`.
High baseline replacement rates will be treated as a signal of resource exhaustion (page cache insufficient to host working set of the PS).
It will be remediated by the storage controller, migrating tenants away to relieve pressure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth caveating this with a note that this RFC/project doesn't cover such migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bulk of the design & coding work is to ensure adherence to the alignment requirements.

Our automed benchmarks are insufficient to rule out performance regressions.
Manual benchmarking / new automated benchmarks will be required for the last two items (new PS PageCache size, avoiding regressions).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify what these benchmarks should be, to avoid open-ended benchmarking work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

02ce39f

I'll leave this open because I can imagine this won't be sufficient for your taste, but ENOTIME at this point.

docs/rfcs/034-direct-io-for-pageserver.md Show resolved Hide resolved
docs/rfcs/034-direct-io-for-pageserver.md Outdated Show resolved Hide resolved
@problame problame changed the title RFC: direct IO for reads RFC: direct IO for Pageserver Jul 9, 2024

The **buffer pool** mentioned to above will be a load-bearing component.
Its basic function is to provide callers with a memory buffer of adequate alignment and size (statx `Dio_mem_align` / `Dio_offset_align`).
Callers `get()` a buffer from the pool. Size is specified at `get` time and is fixed (not growable).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the buffers are non-growable, but each buffer can have different size? I was thinking the buffer pool could give out buffers of the same size. Would this be easier to solve the alignment problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing some details, but are the buffer pool memory buffers allocated using the regular memory allocator or does it require some special mechanism?

For example, a 10 byte sized read or write to offset 5000 in a file will load the file contents
at offset `[4096,8192)` into a free page in the kernel page cache. If necessary, it will evict
other pages to make room (cf eviction). Then, the kernel performs a memory-to-memory copy of 10 bytes
from/to the offset `4` (`5000 = 4096 + 4`) within the cached page. If it's a write, the kernel keeps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from/to the offset `4` (`5000 = 4096 + 4`) within the cached page. If it's a write, the kernel keeps
from/to the offset `904` (`5000 = 4096 + 904`) within the cached page. If it's a write, the kernel keeps

each buffer with all thread-local executors. However, above API requirements for the buffer pool implicitly require the buffer
handle that's returned by `get()` to be a custom smart pointer type. We will be able to extend it in the future to include the
io_uring registered buffer index without having to touch the entire code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the call I was thinking that even with current day apis alloc::alloc::alloc or GlobalAlloc::alloc we are able to construct at runtime alloc::alloc::Layout for properly aligned and sized buffers. With the likely move towards jemalloc, we might be able to get away with buffer pool == tokio::sync::Semaphore + jemalloc (initially?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this idea, jemalloc is probably not a requirement. Pretty sure all allocators have thread local pools.

Together with the work stealing executor it might be there this has outweighed risk of one thread ending up doing all of the allocations, growing it's thread local pool, but then again, could be that on average it actually works out.

Copy link
Contributor Author

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifications / discussions:

  • Virtual File will not be used for buffered IO anymore

    • => tenant conf and similar should just use tokio::fs
    • Arpad: concerns about not using VirtualFile fd cache
  • RFC is unclear about whether buffer pool buffers are all the same size or not

    • Christian thinks we can get away with one single buffer / io size (8K)
    • No Plan B for if we cannot (except using jemalloc, see "option C" in "Discussion" section below)

Discussion:

  • Vlad: prefetch: do we currently rely on the kernel prefetch for perf?
    • compaction, esp the upcomping k-merge compaction, most likely does
    • Will it still be there with direct IO?
  • fixed buffer pool size? What if we run out of memory?
    • option A: "wait for free" mechanism has synchronization overhead and can deadlock
    • option B: "fail with error, ask client to retry with back-off"
      • this would probably be the correct approach for responding to overload
      • page_service protocol doesn't provide this option, though
    • option C: use jemalloc as the buffer pool impl, do not limit buffer pool size
      • use jemalloc for both regular allocations and buffer pool
      • PS DRAM consumption would then be: fixed PageCache + jemalloc
      • => this seems attractive because it punts on drawbacks of (A) and (B) and de-risks OOMs
  • Vlad: plans for new metrics such as queue depth?

@problame
Copy link
Contributor Author

cc #8543

Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because Direct IO would double the IOPS & latency for each compaction read (#8240).

problame added a commit that referenced this pull request Jul 31, 2024
part of #8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* #8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* neondatabase/infra#1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
arpad-m pushed a commit that referenced this pull request Aug 5, 2024
part of #8184

# Problem

We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](https://github.com/neondatabase/neon/blob/e78341e1c220625d9bfa3f08632bd5cfb8e6a876/pageserver/src/tenant/block_io.rs#L229-L236).

# Solution

This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.

`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.

Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.

# Testing / Validation / Performance

`MergeIterator` has not yet been used in production; it's being
developed as part of
* #8002

Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.

Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
  * increased CPU usage
  * ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)

# Rollout

The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging

Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:

* neondatabase/infra#1663

# Interactions With Other Features

This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).

# Future Work

The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.

But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.

Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants