Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rate limiter priority to ReadOptions #9424

Closed
wants to merge 33 commits into from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Jan 24, 2022

Users can set the priority for file reads associated with their operation by setting ReadOptions::rate_limiter_priority to something other than Env::IO_TOTAL. Rate limiting VerifyChecksum() and VerifyFileChecksums() is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

RandomAccessFileReader::Read() already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in Env::IOPriority. Now the compaction read rate limiting is supported by setting rate_limiter_priority = Env::IO_LOW on its ReadOptions.

There is no default value for the new Env::IOPriority parameter to RandomAccessFileReader::Read(). That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a ReadOptions to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided ReadOptions cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide ReadOptions or Env::IOPriority, so I didn't count that as missing.

Test Plan:

  • new unit tests
  • new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
    • setup command: ./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true
    • benchmarks command: strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true
  • crash test using IO_USER priority on non-validation reads with Refactor FilterPolicies toward Customizable #9567 reverted: python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10

@ajkr ajkr force-pushed the rate-limiter-priority-in-read-options branch from 9282547 to c4bfc3d Compare January 24, 2022 19:24
@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@ajkr ajkr force-pushed the rate-limiter-priority-in-read-options branch from 4c124fc to bf60a5c Compare January 25, 2022 08:40
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ajkr ajkr force-pushed the rate-limiter-priority-in-read-options branch from bf60a5c to 7f592b3 Compare February 1, 2022 22:32
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

db/db_test2.cc Outdated Show resolved Hide resolved
include/rocksdb/options.h Outdated Show resolved Hide resolved
include/rocksdb/options.h Outdated Show resolved Hide resolved
include/rocksdb/options.h Outdated Show resolved Hide resolved
Copy link
Contributor

@hx235 hx235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early review: I looked at the code change/testing for compaction and VerifyChecksum and believe it's sufficient for delivery 👍 with a few nits.

Will look at other read-related changes in the second round. If there is a way to split them into two PRs, I am good with approving the compaction/Verifychecksum related one.

One thing though is we might want to work on a better API doc for both DBOptions::rate_limiter and ReadOptions::priority since (1) we expand the scope of things being rate limited by DBOptions::rate_limiter (2) we open a way to bypass that rate limiting using ReadOptions::priority (3) some lower-level operation will ignore ReadOptions::priority no matter what.

I'm thinking of the following comment:
(a) For DBOptions::rate_limiter API:

If not null, bytes_per_sync is set to 1MB by default and it will be used to rate-limit the following (the list is subjected to future expansion):
Read:
- Compaction read (at IOPriority::LOW)
- DB::VerifyChecksum(at specified priority or be bypassed, see ReadOptions::priority for more)
Write:
- Compaction write (at IOPriority::LOW)
- Flush write (at IOPriority::HIGH)

Default: nullptr (rate limiting disabled)

(b) ReadOptions::rate_limiter: see PR comment

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Early review: I looked at the code change/testing for compaction and VerifyChecksum and believe it's sufficient for delivery 👍 with a few nits.

Will look at other read-related changes in the second round. If there is a way to split them into two PRs, I am good with approving the compaction/Verifychecksum related one.

Sorry the main idea of this PR requires them to be quite entangled (see below).

One thing though is we might want to work on a better API doc for both DBOptions::rate_limiter and ReadOptions::priority since (1) we expand the scope of things being rate limited by DBOptions::rate_limiter (2) we open a way to bypass that rate limiting using ReadOptions::priority (3) some lower-level operation will ignore ReadOptions::priority no matter what.

I'm thinking of the following comment: (a) For DBOptions::rate_limiter API:

If not null, bytes_per_sync is set to 1MB by default and it will be used to rate-limit the following (the list is subjected to future expansion):
Read:
- Compaction read (at IOPriority::LOW)
- DB::VerifyChecksum(at specified priority or be bypassed, see ReadOptions::priority for more)
Write:
- Compaction write (at IOPriority::LOW)
- Flush write (at IOPriority::HIGH)

Default: nullptr (rate limiting disabled)

(b) ReadOptions::rate_limiter: see PR comment

The goal of this PR is to generally support ReadOptions::priority besides in unimportant cases and a handful of uncommon cases.

It is easier to describe where support is missing than where support is available. Describing either one should be acceptable, IMO, since it's the complement of the other. I am happy to reexamine the list of missing support, both for making sure it is at the proper level of abstraction (for example, the reference to MultiRead() is too low-level) and making sure it is complete (for example, I have not checked for APIs accepting ReadOptions that do their reads outside RandomAccessFileReader).

It is difficult to list available support because this PR made changes at a low level to support most user APIs (without needing to look at what exactly they are) at once. For example, a lot of APIs (Get(), iterator, compaction, etc.) already passed their ReadOptions all the way down to BlockFetcher. I only made a small code change to BlockFetcher to make rate limiting available to all those APIs. That is also why this approach requires we do it all at once -- if we wanted to only support compaction/VerifyChecksum, we'd have to go into BlockFetcher and explicitly remove support for all other cases, which seems like extra code for an unnecessary limitation.

Meanwhile, it was actually fairly easy to identify the user API corresponding to missing support. I guess it's because that case is (luckily) much less common, where the user can provide ReadOptions that are inaccessible to RandomAccessFileReader.

@hx235
Copy link
Contributor

hx235 commented Feb 2, 2022

Aha - I could see that now. Read operations that pass ReadOptions down to RandomAccessFileReader can now choose to have rate limiting support, the list of which is pretty big compared with the counterpart. The list will be further expanded once we cover SequentialFileReader.

I am happy to reexamine the list of missing support, both for making sure it is at the proper level of abstraction (for example, the reference to MultiRead() is too low-level) and making sure it is complete (for example, I have not checked for APIs accepting ReadOptions that do their reads outside RandomAccessFileReader).

This will be great. I'm good with listing unsupported user read API (such as DB::MultiGet(), old BlobDB reads, general reads under plain/cuckoo table configuration) in DBOptions::rate_limiter API doc so that users know what NOT to expect for rate limiting support. It will also be good to briefly mention there something like "see ReadOptions::priority for more info on specifying rate-limiting priority of each read or bypassing the rate limiting" so that users know how correctly add such rate-limiting support. For internal read/write of compaction and write of flush, contrary to user read API, their priorities are fixed and will not be bypassed using ReadOptions::priority, which is basically what the existing DBOptions::rate_limiter API describes.

I will also update the comment as getting SequentialFileReader and write-side of things done.

@ajkr ajkr force-pushed the rate-limiter-priority-in-read-options branch from 7f592b3 to 6439991 Compare February 2, 2022 20:30
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

3 similar comments
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be great. I'm good with listing unsupported user read API (such as DB::MultiGet(), old BlobDB reads, general reads under plain/cuckoo table configuration) in DBOptions::rate_limiter API doc so that users know what NOT to expect for rate limiting support. It will also be good to briefly mention there something like "see ReadOptions::priority for more info on specifying rate-limiting priority of each read or bypassing the rate limiting" so that users know how correctly add such rate-limiting support.

I put the list in one place (ReadOptions::priority) and referred to it from the DBOptions::rate_limiter. The list of scenarios that ignore/do not support ReadOptions::priority is shorter than the list of scenarios where read rate limiting is unavailable. Scenarios like reading MANIFEST/WALs during DB::Open() do not support read rate limiting, but I do not think we need to explicitly list cases where the API offers no way to enable rate limiting (there is no ReadOptions passed to DB::Open(), for example).

Also done:

  • Went through all clients of SequentialFileReader and did not find any where the user-facing API offers a way to enable rate limiting.
  • Updated to specify which MultiGet() APIs are supported at the proper level of abstraction.
  • Removed mention of old BlobDB since it was never actually exposed in any public header.
  • Elaborated on "cuckoo/plain table" so people can identify whether they are relevant to their use case

For internal read/write of compaction and write of flush, contrary to user read API, their priorities are fixed and will not be bypassed using ReadOptions::priority, which is basically what the existing DBOptions::rate_limiter API describes.

Right. Keep in mind there is no way for a user to supply the ReadOptions for flush or compaction, so we don't really have a choice about bypassing ReadOptions::priority or not.

include/rocksdb/options.h Outdated Show resolved Hide resolved
@ajkr ajkr force-pushed the rate-limiter-priority-in-read-options branch from fa05d10 to f22a4d9 Compare February 2, 2022 23:31
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Contributor

hx235 commented Feb 2, 2022

The list of scenarios that ignore/do not support ReadOptions::priority is shorter than the list of scenarios where read rate limiting is unavailable. Scenarios like reading MANIFEST/WALs during DB::Open() do not support read rate limiting, but I do not think we need to explicitly list cases where the API offers no way to enable rate limiting (there is no ReadOptions passed to DB::Open(), for example).

Sounds good - thanks for clarification. Sounds like now we make ReadOptions a necessary condition to having rate limiting support for user read. Then we either take ReadOptions::priority into account (i.e, user specifying a priority or RocksDB internal sets IO_TOTAL as not-supported) or just ignore it.

Good with me! 👍

@hx235
Copy link
Contributor

hx235 commented Feb 3, 2022

Not a rush to finishing the PR but are we still going to change the name for ReadOptions::priority? Since it's a public API, it will be good to finalize the naming as early as possible.

Other changes in the update look good!

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@ajkr ajkr force-pushed the rate-limiter-priority-in-read-options branch from 61c7b6d to 10574a5 Compare February 17, 2022 04:08
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@@ -349,6 +349,8 @@ bool StressTest::VerifySecondaries() {
fprintf(stderr, "Secondary failed to catch up with primary\n");
return false;
}
// This `ReadOptions` is for validation purposes. Ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@hx235 hx235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to add it to db_crashtest.py for automating the test?

Others LGTM!

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might worth adding an option FLAG_read_rate_limiter_priority to stress test considering it touches many aspects of our read. Hopefully it wouldn’t be hard as a quick search in db_stress_test_base reveals 11 ReadOptions.

Done. Although, keep in mind the crash test wrapper does not use rate limiter yet, so the support is a bit superficial for now. We will enable rate limiting eventually, but not in the near future (IngestExternalFile() is my next step to enable). I at least ran it manually and added the command to the test plan to show the new feature can work.

db/db_test2.cc Outdated Show resolved Hide resolved
db/db_rate_limiter_test.cc Show resolved Hide resolved
@ajkr
Copy link
Contributor Author

ajkr commented Feb 17, 2022

Do we also need to add it to db_crashtest.py for automating the test?

Sorry I wrote my last comment without refreshing. Basically yes, but it's non-trivial to enable it without slowing things down too much, so I hope to defer it.

edit: Another reason I'm seeing it's nontrivial as it runs is it's printing a lot of these lines:

***MultiGet error: Not implemented: Unable to rate limit MultiRead()***

@hx235
Copy link
Contributor

hx235 commented Feb 17, 2022

Do we also need to add it to db_crashtest.py for automating the test?

Sorry I wrote my last comment without refreshing. Basically yes, but it's non-trivial to enable it without slowing things down too much, so I hope to defer it.

Sure! I was actually just referring to something like "rate_limit_user_ops": lambda: random.choice([0, 1]) cuz I didn't realize "crash test wrapper does not use rate limiter yet" :p

@hx235
Copy link
Contributor

hx235 commented Feb 17, 2022

edit: Another reason I'm seeing it's nontrivial as it runs is it's printing a lot of these lines:

Got you! Might be good to just leave a comment //"rate_limit_user_ops": lambda: random.choice([0, 1]) in case we (or I) forget to add it later.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@ajkr
Copy link
Contributor Author

ajkr commented Feb 17, 2022

edit: Another reason I'm seeing it's nontrivial as it runs is it's printing a lot of these lines:

Got you! Might be good to just leave a comment //"rate_limit_user_ops": lambda: random.choice([0, 1]) in case we (or I) forget to add it later.

Not sure, I think the person to enable rate limiting in CI will need to do a tour of all the possibilities. There's a bunch of things I would do if we did it right now, like renaming rate_limit_bg_reads, making it support kAllIo (currently we only support reads or writes, never both), picking a rate_limiter_bytes_per_sec that doesn't slow things down, dynamically changing the rate limit, enabling IO_USER, etc. etc. But I guess whoever eventually does it will have to explore a bit to figure out the right way. I kind of don't think they'll miss the flags with "rate_limit" in the name :).

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor

hx235 commented Feb 17, 2022

edit: Another reason I'm seeing it's nontrivial as it runs is it's printing a lot of these lines:

Got you! Might be good to just leave a comment //"rate_limit_user_ops": lambda: random.choice([0, 1]) in case we (or I) forget to add it later.

Not sure, I think the person to enable rate limiting in CI will need to do a tour of all the possibilities. There's a bunch of things I would do if we did it right now, like renaming rate_limit_bg_reads, making it support kAllIo (currently we only support reads or writes, never both), picking a rate_limiter_bytes_per_sec that doesn't slow things down, dynamically changing the rate limit, enabling IO_USER, etc. etc. But I guess whoever eventually does it will have to explore a bit to figure out the right way. I kind of don't think they'll miss the flags with "rate_limit" in the name :).

Sounds good! I would likely check back on this comment if I happen to be that person to enable rate limiting in CI .....

facebook-github-bot pushed a commit that referenced this pull request Jun 17, 2022
Summary:
**Context/Summary:**
#9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.

However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.

Pull Request resolved: #10159

Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of  `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
   - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work

Reviewed By: ajkr, anand1976

Differential Revision: D37135172

Pulled By: hx235

fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
facebook-github-bot pushed a commit that referenced this pull request Aug 9, 2023
…+ misc (#11444)

Summary:
**Context/Summary:**
- Similar to #11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with #11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](#9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: #11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: google/benchmark@604f6fd
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants