-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logstore: sync sideloaded storage directories #114616
Conversation
This commit removes the dirCreated field of DiskSideloadStorage, because it is only used in tests, and is reduntant (directory existence check already does the job). Epic: none Release note: none
44505a2
to
f8410b4
Compare
Epic: none Release note: none
f8410b4
to
ac9bda6
Compare
ac9bda6
to
d4688eb
Compare
Recommend reviewing commit-by-commit. |
return err | ||
} | ||
continue | ||
} | ||
} | ||
|
||
// Sync implements SideloadStorage. | ||
func (ss *DiskSideloadStorage) Sync() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: this method implementation does two syscalls: OpenDir
and Sync
.
I considered calling OpenDir
once, around the time when the directories are opened/created, and storing the handle in the struct. All the Sync
calls could then reuse this handle, and do just a single syscall.
However, the handle holds a file descriptor. There is one sideloaded storage instance per Range
, so correspondingly we may have too many descriptors open at once. Since the descriptors are a limited resource, I decided against this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we run any benchmarks to look at the performance impact?
d4688eb
to
4b5e1bf
Compare
A couple of things to address in the future: sideloaded files removal should happen strictly after a state machine sync; sideloaded files and directories should be cleaned up on startup because their removal is not always durable. Epic: none Release note: none
4b5e1bf
to
8f3e7ab
Compare
No. Some of We already sync every 512 KiB when writing each sideloaded file. After this PR we will sync the parent directory of this file too, so that's K+1 syncs instead of K. This might have some impact, but not much if the This change doesn't feel optional though. It's a correctness/durability bug. We could optimize it later by having a better log storage (e.g. batch multiple entries into a single file since this is mostly append-only; but then would Pebble be able to ingest partial files with the current API?). |
Yeah, merging and looking at roachperf nightlies is probably sufficient. But i would be good to have a microbenchmark too, if we don't already have one, to look at the worst-case impact.
Yeah, the directory fsyncs are also tiny and won't block on any other writes (unlike e.g. the WAL fsyncs).
Yeah, I agree that this isn't optional, but knowing the perf impact would give us some idea of what to expect from a backport. If the impact is severe (I don't expect it to be) then we may be more hesitant to backport this to a patch release. We could also consider simply spawning a separate goroutine to do the directory fsync concurrently with the SST fsync, which should ~eliminate the latency cost. Only if we need to though. |
I'm inclined to think we should be fine with the existing benchmarks. Not sure microbenchmarking is the best framework for testing this kind of change. What kind of microbenchmarks do you mean? Typically microbenchmarks I think of are in-memory, and geared towards measuring things like CPU and memory usage. In this case it wouldn't give us much: the cost of this PR is one extra Maybe we could do a benchmark over a real on-disk storage, but then to make it "worst case" we need to put quite an effort into this test. We could do a simple thing like a single-range write-many-AddSSTs, but it wouldn't be stressing the system enough. Instead, we could just piggyback on the Here are a few metrics before/after this change, on the Raft batch apply latency 18mscockroach/pkg/kv/kvserver/metrics.go Line 1305 in 933c4cd
The "after" even won the benchmark, although I think this is just not a particularly stable metric:
|
This commit adds `TestSideloadStorageSync` which demonstrates that the sideloaded log storage can lose files and directories upon system crash. This is due to the fact that the whole directory hierarchy is not properly synced when the directories and files are created. A typical sideloaded storage file (entry 123 at term 44 for range r1234) looks like: `<data-dir>/auxiliary/sideloading/r1XXX/r1234/i123.t44`. Only existence of auxiliary directory is persisted upon its creation, by syncing the <data-dir> when Pebble initializes the store. All other directories (sideloading, r1XXX, r1234) are not persisted upon creation. Epic: none Release note: none
@erikgrinaker Could you please do another pass when you have a multi-minute? At this point I'm fairly confident in this PR, except maybe I further split this PR into commits, so that the incremental "story" is more apparent (particularly see the evolution of fixing TODOs in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 1 of 6 files at r3, 5 of 5 files at r4, 4 of 4 files at r6, 3 of 3 files at r7, 3 of 3 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cockroachdb, @erikgrinaker, @pavelkalinnikov, and @RaduBerinde)
-- commits
line 50 at r7:
partial
pkg/kv/kvserver/logstore/sideload.go
line 37 at r8 (raw file):
// term. Overwrites the file if it already exists. Put(_ context.Context, index kvpb.RaftIndex, term kvpb.RaftTerm, contents []byte) error // Sync syncs the underlying filesystem metadata so that all the preceding
Why expose Sync
on the SideloadStorage
interface instead of pushing the syncs into the Put
/Clear
/etc. methods, like we've done for the directory creation syncs?
pkg/kv/kvserver/logstore/sideload_disk.go
line 300 at r7 (raw file):
// path are combined using the provided filesystem interface. func mkdirAllAndSyncParents(fs vfs.FS, base string, dirs []string, perm os.FileMode) error { paths := append(make([]string, 0, len(dirs)+1), base)
We don't need to complicate the code if it doesn't feel worthwhile to you, but enumerating all sub-paths and constructing individual path strings to test feels a bit wasteful. I wonder whether iterative calls to path.Split
over a precomputed path
string would make this logic easier to reason about.
pkg/kv/kvserver/logstore/sideload_disk.go
line 317 at r7 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
This can be problematic if someone drops the
auxiliary
directory for fun (for example, they see it's empty and assume it is then safe to remove it).Maybe we should walk all the way up to the Pebble root directory instead, to be more robust.
Walking all the way up to the Pebble root directory and not assuming that "auxiliary" exists makes sense to me.
Or alternatively we could remove the distinction between the base directory path and the sub-directories we are trying to create and just create+sync whatever is necessary without making assumptions. This is analogous to what MkdirAll does.
pkg/kv/kvserver/logstore/sideload_test.go
line 594 at r6 (raw file):
const testEntry = "test-entry" require.NoError(t, ss.Put(ctx, 100 /* index */, 6 /* term */, []byte(testEntry))) // Cut off all syncs from this point, to emulate a crash.
Why is this needed? Does eng.Close()
try to sync before closing?
Won't have time today unfortunately, will have a look next week. |
c4dce4e
to
48a6911
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @RaduBerinde)
pkg/kv/kvserver/logstore/sideload.go
line 37 at r8 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why expose
Sync
on theSideloadStorage
interface instead of pushing the syncs into thePut
/Clear
/etc. methods, like we've done for the directory creation syncs?
This is to enable some batching. We can Put
multiple entries, and then Sync
the directory once to persist the addition of all the entries at once.
pkg/kv/kvserver/logstore/sideload_disk.go
line 309 at r3 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
The other implementation does:
while path != "." { process(path) path = parent(path) // who guarantees that this will always return "." in the end? // also, all vfs.FS implementations must return this "." which is a lot to expect // on top of that, I ran some tests, and parent(path) sometimes did // not return ".", and had special effects depending on whether the path // is absolute or relative }
The current implementation always terminates because the path is already decomposed and has a predefined number of elements.
Changed the code to walk directories up using the fs.PathDir()
method.
pkg/kv/kvserver/logstore/sideload_disk.go
line 300 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We don't need to complicate the code if it doesn't feel worthwhile to you, but enumerating all sub-paths and constructing individual path strings to test feels a bit wasteful. I wonder whether iterative calls to
path.Split
over a precomputedpath
string would make this logic easier to reason about.
We can't use the path
package because the path syntax may depend on the platform. We can't use path/filepath
package either because the path syntax can be specific to the vfs.FS
implementation (e.g. MemFS
), and not match the platform syntax. We thus have to use the vfs.FS
helpers for manipulating paths. These and are somewhat underspecified in how they handle absolute/relative paths and root or '.' directories, so in my initial implementation I chose to not use them.
Changed the code to use the fs.PathDir()
helper though, PTAL.
pkg/kv/kvserver/logstore/sideload_disk.go
line 317 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Walking all the way up to the Pebble root directory and not assuming that "auxiliary" exists makes sense to me.
Or alternatively we could remove the distinction between the base directory path and the sub-directories we are trying to create and just create+sync whatever is necessary without making assumptions. This is analogous to what MkdirAll does.
Done the latter.
pkg/kv/kvserver/logstore/sideload_test.go
line 594 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this needed? Does
eng.Close()
try to sync before closing?
The idea of this test is that we discard all the writes/syncs below, including the ones that eng.Close()
might potentially do to gracefully terminate things. We are emulating a crash (see the comment below).
Correspondingly, calling ss.Sync()
makes a difference to whether the entry persists or not (see the asserts below).
1f1693c
to
ca52afa
Compare
ca52afa
to
2d19482
Compare
@erikgrinaker @nvanbenschoten Addressed all your comments. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the hold-up, appreciate the patience.
} | ||
parent = fs.PathDir(dir) | ||
// NB: not checking against the separator, to be platform-agnostic. | ||
if dir == "." || parent == dir { // reached the topmost dir or the root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of special-casing .
(and what about ..
?), we could simply call filepath.Abs()
on the path first, which will resolve relative paths and normalize out any .
and ..
components.
Nevermind, I see your comment above that we have to rely on Pebble helpers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually have no idea how these helpers behave with paths containing ..
, or in general with non-canonical paths. These are extremely brittle, and I think we should not use them. It would be cleaner to have a type-safe "canonical path" notion which is known to be relative to the data directory root which exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That probably means that we have similar challenges in Pebble itself. If this code is similar to what's already done in storage/Pebble that's probably good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assumptions are:
- Pebble root directory exists.
- All paths within this directory are well-formed / canonical suffixes of the pebble root path.
- We never walk above the pebble root, so we're effectively always iterating nice canonical substrings.
All these things could be made explicit and constrained within a type-safe helper, so that the space of things we should worry about is reduced. Posted some ideas internally.
defer leaktest.AfterTest(t)() | ||
defer log.Scope(t).Close(t) | ||
|
||
for _, tc := range []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding some test cases here with relative paths as well as intermediate .
and ..
components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemFS
doesn't work well with those. For example, a/b/c0/../c1
creates a/b/c0/../c1
(literally), and after a restart only a/b/c0
is persisted (possibly because the MemFS.PathDir()
loop walks a different hierarchy than MemFS.MkdirAll
).
Similarly, a/b/c0/./c1
creates a/b/c0/./c1
(literally), and after a restart only a/b/c0
is persisted.
Makes little sense for now ¯\_(ツ)_/¯
, will take a look in the morning. Exactly why I didn't want to rely on PathDir()
helpers in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some relative paths and ..
things. Also added a TODO to make these types of things safer.
How do you feel about backporting this to 23.2, but leaving the older backports to bake for a while? That should give us a month of baking time for 23.2, which seems sufficient. |
2d19482
to
dad678b
Compare
The sideloaded log storage does not sync the hierarchy of directories it creates. This can potentially lead to full or partial loss of its sub-directories in case of a system crash or power off. After this commit, every time sideloaded storage creates a directory, it syncs its parent so that the reference is durable. Epic: none Release note: none
The sideloaded storage fsyncs the files that it creates. Even though this guarantees durability of the files content and metadata, this still does not guarantee that the references to these files are durable. For example, Linux man page for fsync [^1] says: ``` Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed. ``` It means that these files can be lost after a system crash of power off. This leads to issues: 1. The storage syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas. 2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum. This commit fixes the above issues, by syncing the parent directory after writing sideloaded entry files. The natural point for this is MaybeSideloadEntries on the handleRaftReady path. [^1]: https://man7.org/linux/man-pages/man2/fsync.2.html Epic: none Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (AddSSTable) that is e.g. used by RESTORE. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a coordinated power-off/crash across multiple nodes this could lead to an unrecoverable loss of quorum.
dad678b
to
0591d3f
Compare
bors r=erikgrinaker |
Build succeeded: |
blathers backport 23.2.0-rc |
blathers backport 23.1 22.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 4d1f243 to blathers/backport-release-23.1-114616: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1 failed. See errors above. error creating merge commit from 4d1f243 to blathers/backport-release-22.2-114616: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport 23.1 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 4d1f243 to blathers/backport-release-23.1-114616: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.
Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for
fsync
1 says:It means that these files can be lost after a system crash of power off. This leads to issues because:
Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.
The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.
This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.
Part of #114411
Epic: none
Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (
AddSSTable
) that is e.g. used byRESTORE
. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.Footnotes
https://man7.org/linux/man-pages/man2/fsync.2.html ↩