-
Notifications
You must be signed in to change notification settings - Fork 1.4k
ShardedRocks determinism phase 2 (PSM, Checkpoint) #12203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
046d510
to
25efe40
Compare
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
…f CheckpointMetaData object
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@@ -0,0 +1,145 @@ | |||
#include "fdbclient/StorageCheckpoint.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the copyright header?
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Description
This is part 2 of #11841. It took a few weeks of iteration of identifying different sources of non-determinism and then figuring out the simplest solution.
In terms of root cause behind the non-determinism, all offending instances were in the checkpoint metadata category. When we send this checkpoint over the network, flow transport relies on the byte size of the payload and the run would diverge if the payload size is even slightly different.
Generally, once the root cause is identified, the hard part is done, and the solution can be relatively straightforward using the principles described in #11790. As a reminder the goal is to ensure simulation behavior is as close to production as possible, at the same time, ensure the runs are deterministic in simulation. What made solving sharded rocks with PSM and checkpoint hard is two fold:
There are just lots of metadata fields that are fetched via rocksdb api. It's not scalable to go one field at a time and pick a sane deterministic value.
foundationdb/fdbserver/RocksDBCheckpointUtils.actor.cpp
Lines 77 to 103 in 046d510
The checkpoint goes through network transport layer, which is too low level to change the behavior of in order to get determinism.
So instead, this was the solution I converged towards:
First, limit the entry points of all sets/gets to checkpoint. This is simply done by having set/get APIs. What then this gives is that I can change the internal representation of the checkpoint at one place without letting any of the clients know. This is only done in simulation. Even in simulation, from a client's pov, you write X, you get X.
Once I had control over the checkpoint internal representation without affecting clients, then I can change the representation by adding dynamic padding to it. The dynamic here means we round the payload size to the next Nth byte size. For example, payload of 100 bytes -> padded to 5000 bytes, payload of 5001 bytes -> padded to 10000 bytes, etc. The N is configurable but what this does is makes payload size deterministic -- there's a probability here but more discussion on that in the FAQ section. Finally, there was a footer that encodes the exact payload size picked, so when someone requests it, we can extract the original content back.
FAQs
These are questions I was asking myself as I worked towards solving this issue.
Q1: Why not use the approach of picking sane deterministically random values for the metadata fields?
A1: As described above, there were too many fields, and then we go through the network layer which is too low level and does not have field-by-field granularity, only "payload" or packet level granularity.
Q2: Is there any functional change in production?
A2: No. In production, we exit early at the set/get level. Even in simulation, there should be no functional behavior change since client is not aware of the internal details of dynamic padding and flow transport layer.
Q3: Ok, in simulation, surely there has to be a cost. What is it?
A3: Yes, there is a cost. I would put all of that under "performance" category, which is not the goal of simulation. The fact that we store additional padding and that too inefficiently (it's optimized for readability and debugging) means we're using more memory and in fact more simulated network i/o. All of this is ok as long as performance degradations don't cause significant slow downs, OOM or crashes.
Q4: How do you pick PAYLOAD_ROUND_TO_NEXT, what are the tradeoffs, and how do you know it does not cause issues like OOM?
A4: Empirically via experimentation. The tradeoff is that lower the value, lower memory but higher probability of non-determinism. From my experiments, rocksdb metadata variation, atleast over the wire, is a few KBs, not 10s of KBs. I picked 5K as a result and so far have not seen any OOMs/timeouts/crashes. This can always be tuned as needed.
Testing
Added new unit tests.
Ran PhysicalShardMove.toml 500K times, saw 0 instances of UnseedMismatch:
20250620-211429-praza-shardedrocks-determinism-phase2-e54019 compressed=True data_size=37114014 duration=19311177 ended=500000 fail=8 fail_fast=25 max_runs=500000 pass=499992 priority=100 remaining=0 runtime=20:11:28 sanity=False started=500000 stopped=20250621-172557 submitted=20250620-211429 timeout=5400 username=praza-shardedrocks-determinism-phase2-e54019c6b7b1452581c954dc4c8cb57faa2aedd9
Ran general 500K, saw 0 instances of UnseedMismatch:
20250620-211406-praza-shardedrocks-determinism-phase2-e54019 compressed=True data_size=37084541 duration=24685358 ended=500000 fail=6 fail_fast=25 max_runs=500000 pass=499994 priority=100 remaining=0 runtime=23:02:25 sanity=False started=500000 stopped=20250621-201631 submitted=20250620-211406 timeout=5400 username=praza-shardedrocks-determinism-phase2-e54019c6b7b1452581c954dc4c8cb57faa2aedd9
Ran general 500K while forcing sharded rocks, saw 0 instances of UnseedMismatch:
20250620-212116-praza-shardedrocks-determinism-phase2-forceS compressed=True data_size=37084540 duration=26423186 ended=500000 fail=8 fail_fast=25 max_runs=500000 pass=499992 priority=100 remaining=0 runtime=23:10:01 sanity=False started=500000 stopped=20250621-203117 submitted=20250620-212116 timeout=5400 username=praza-shardedrocks-determinism-phase2-forceSR-8745a0d9340d3a2c6f2c0c7beea251dd33e81ce8
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)