Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logstore: sync sideloaded storage directories #114616

Merged
merged 6 commits into from
Dec 6, 2023

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Nov 16, 2023

This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.

Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for fsync 1 says:

Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.

It means that these files can be lost after a system crash of power off. This leads to issues because:

  1. Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.

  2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.

This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.

Part of #114411

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (AddSSTable) that is e.g. used by RESTORE. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.

Footnotes

  1. https://man7.org/linux/man-pages/man2/fsync.2.html

This commit removes the dirCreated field of DiskSideloadStorage, because
it is only used in tests, and is reduntant (directory existence check
already does the job).

Epic: none
Release note: none
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv requested review from erikgrinaker, a team and RaduBerinde November 16, 2023 22:54
@pav-kv pav-kv marked this pull request as ready for review November 16, 2023 23:01
@pav-kv pav-kv requested a review from a team November 16, 2023 23:01
@pav-kv pav-kv requested a review from a team as a code owner November 16, 2023 23:01
@pav-kv
Copy link
Collaborator Author

pav-kv commented Nov 16, 2023

Recommend reviewing commit-by-commit.

@pav-kv pav-kv changed the title logstore: sync sideloaded storage directory logstore: sync sideloaded storage directories Nov 17, 2023
return err
}
continue
}
}

// Sync implements SideloadStorage.
func (ss *DiskSideloadStorage) Sync() error {
Copy link
Collaborator Author

@pav-kv pav-kv Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: this method implementation does two syscalls: OpenDir and Sync.

I considered calling OpenDir once, around the time when the directories are opened/created, and storing the handle in the struct. All the Sync calls could then reuse this handle, and do just a single syscall.

However, the handle holds a file descriptor. There is one sideloaded storage instance per Range, so correspondingly we may have too many descriptors open at once. Since the descriptors are a limited resource, I decided against this approach.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we run any benchmarks to look at the performance impact?

pkg/kv/kvserver/logstore/sideload.go Show resolved Hide resolved
pkg/kv/kvserver/replica_destroy.go Show resolved Hide resolved
pkg/kv/kvserver/replica_application_result.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/logstore/sideload_test.go Show resolved Hide resolved
pkg/kv/kvserver/logstore/sideload_disk.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/logstore/sideload_disk.go Outdated Show resolved Hide resolved
A couple of things to address in the future: sideloaded files removal
should happen strictly after a state machine sync; sideloaded files and
directories should be cleaned up on startup because their removal is not
always durable.

Epic: none
Release note: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Nov 21, 2023

Have we run any benchmarks to look at the performance impact?

No. Some of AddSST-heavy roachtest would do, like restore ones?

We already sync every 512 KiB when writing each sideloaded file. After this PR we will sync the parent directory of this file too, so that's K+1 syncs instead of K. This might have some impact, but not much if the AddSSTs are multi-MiB in size. If this becomes a problem, one can double kv.bulk_sst.sync_size to 1MiB.

This change doesn't feel optional though. It's a correctness/durability bug. We could optimize it later by having a better log storage (e.g. batch multiple entries into a single file since this is mostly append-only; but then would Pebble be able to ingest partial files with the current API?).

@erikgrinaker
Copy link
Contributor

Have we run any benchmarks to look at the performance impact?

No. Some of AddSST-heavy roachtest would do, like restore ones?

Yeah, merging and looking at roachperf nightlies is probably sufficient. But i would be good to have a microbenchmark too, if we don't already have one, to look at the worst-case impact.

We already sync every 512 KiB when writing each sideloaded file. After this PR we will sync the parent directory of this file too, so that's K+1 syncs instead of K. This might have some impact, but not much if the AddSSTs are multi-MiB in size. If this becomes a problem, one can double kv.bulk_sst.sync_size to 1MiB.

Yeah, the directory fsyncs are also tiny and won't block on any other writes (unlike e.g. the WAL fsyncs).

This change doesn't feel optional though. It's a correctness/durability bug. We could optimize it later by having a better log storage (e.g. batch multiple entries into a single file since this is mostly append-only; but then would Pebble be able to ingest partial files with the current API?).

Yeah, I agree that this isn't optional, but knowing the perf impact would give us some idea of what to expect from a backport. If the impact is severe (I don't expect it to be) then we may be more hesitant to backport this to a patch release.

We could also consider simply spawning a separate goroutine to do the directory fsync concurrently with the SST fsync, which should ~eliminate the latency cost. Only if we need to though.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Nov 29, 2023

Yeah, merging and looking at roachperf nightlies is probably sufficient. But i would be good to have a microbenchmark too, if we don't already have one, to look at the worst-case impact.

I'm inclined to think we should be fine with the existing benchmarks.

Not sure microbenchmarking is the best framework for testing this kind of change. What kind of microbenchmarks do you mean? Typically microbenchmarks I think of are in-memory, and geared towards measuring things like CPU and memory usage. In this case it wouldn't give us much: the cost of this PR is one extra OpenDir + Sync (in cases when we commit AddSST entries) which is negligible with the in-memory FS, the real impact is in IO.

Maybe we could do a benchmark over a real on-disk storage, but then to make it "worst case" we need to put quite an effort into this test. We could do a simple thing like a single-range write-many-AddSSTs, but it wouldn't be stressing the system enough. Instead, we could just piggyback on the restore tests which are closer to the real worst case?

Here are a few metrics before/after this change, on the restore/tpce/400GB/gce/nodes=4/cpus=8 test. The differences are within an error margin.

Storage WAL fsync 12ms Screenshot 2023-11-29 at 20 53 36 Screenshot 2023-11-29 at 20 54 06
Write bytes 120 M/s Screenshot 2023-11-29 at 20 54 22 Screenshot 2023-11-29 at 20 54 30
Raft batch apply latency 18ms

Name: "raft.process.commandcommit.latency",

Screenshot 2023-11-29 at 20 54 41 Screenshot 2023-11-29 at 20 54 54

The "after" even won the benchmark, although I think this is just not a particularly stable metric:

$ grep Throughput: *txt
after-2.txt:; Throughput: 88.786256 mb / node / second
after.txt:; Throughput: 83.704015 mb / node / second
before-2.txt:; Throughput: 79.064119 mb / node / second
before.txt:; Throughput: 62.075737 mb / node / second

This commit adds `TestSideloadStorageSync` which demonstrates that the
sideloaded log storage can lose files and directories upon system crash.
This is due to the fact that the whole directory hierarchy is not
properly synced when the directories and files are created.

A typical sideloaded storage file (entry 123 at term 44 for range r1234)
looks like: `<data-dir>/auxiliary/sideloading/r1XXX/r1234/i123.t44`.

Only existence of auxiliary directory is persisted upon its creation, by
syncing the <data-dir> when Pebble initializes the store. All other
directories (sideloading, r1XXX, r1234) are not persisted upon creation.

Epic: none
Release note: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Nov 29, 2023

@erikgrinaker Could you please do another pass when you have a multi-minute? At this point I'm fairly confident in this PR, except maybe mkdirAllAndSyncParents has to walk higher than auxiliary (see the comment inline).

I further split this PR into commits, so that the incremental "story" is more apparent (particularly see the evolution of fixing TODOs in TestSideloadStorageSync in the last 3 commits).

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 1 of 6 files at r3, 5 of 5 files at r4, 4 of 4 files at r6, 3 of 3 files at r7, 3 of 3 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cockroachdb, @erikgrinaker, @pavelkalinnikov, and @RaduBerinde)


-- commits line 50 at r7:
partial


pkg/kv/kvserver/logstore/sideload.go line 37 at r8 (raw file):

	// term. Overwrites the file if it already exists.
	Put(_ context.Context, index kvpb.RaftIndex, term kvpb.RaftTerm, contents []byte) error
	// Sync syncs the underlying filesystem metadata so that all the preceding

Why expose Sync on the SideloadStorage interface instead of pushing the syncs into the Put/Clear/etc. methods, like we've done for the directory creation syncs?


pkg/kv/kvserver/logstore/sideload_disk.go line 300 at r7 (raw file):

// path are combined using the provided filesystem interface.
func mkdirAllAndSyncParents(fs vfs.FS, base string, dirs []string, perm os.FileMode) error {
	paths := append(make([]string, 0, len(dirs)+1), base)

We don't need to complicate the code if it doesn't feel worthwhile to you, but enumerating all sub-paths and constructing individual path strings to test feels a bit wasteful. I wonder whether iterative calls to path.Split over a precomputed path string would make this logic easier to reason about.


pkg/kv/kvserver/logstore/sideload_disk.go line 317 at r7 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

This can be problematic if someone drops the auxiliary directory for fun (for example, they see it's empty and assume it is then safe to remove it).

Maybe we should walk all the way up to the Pebble root directory instead, to be more robust.

Walking all the way up to the Pebble root directory and not assuming that "auxiliary" exists makes sense to me.

Or alternatively we could remove the distinction between the base directory path and the sub-directories we are trying to create and just create+sync whatever is necessary without making assumptions. This is analogous to what MkdirAll does.


pkg/kv/kvserver/logstore/sideload_test.go line 594 at r6 (raw file):

	const testEntry = "test-entry"
	require.NoError(t, ss.Put(ctx, 100 /* index */, 6 /* term */, []byte(testEntry)))
	// Cut off all syncs from this point, to emulate a crash.

Why is this needed? Does eng.Close() try to sync before closing?

@erikgrinaker
Copy link
Contributor

Could you please do another pass when you have a multi-minute?

Won't have time today unfortunately, will have a look next week.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @RaduBerinde)


pkg/kv/kvserver/logstore/sideload.go line 37 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why expose Sync on the SideloadStorage interface instead of pushing the syncs into the Put/Clear/etc. methods, like we've done for the directory creation syncs?

This is to enable some batching. We can Put multiple entries, and then Sync the directory once to persist the addition of all the entries at once.


pkg/kv/kvserver/logstore/sideload_disk.go line 309 at r3 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

The other implementation does:

while path != "." {
	process(path)
	path = parent(path) // who guarantees that this will always return "." in the end?
	// also, all vfs.FS implementations must return this "." which is a lot to expect

	// on top of that, I ran some tests, and parent(path) sometimes did
	// not return ".", and had special effects depending on whether the path
	// is absolute or relative
}

The current implementation always terminates because the path is already decomposed and has a predefined number of elements.

Changed the code to walk directories up using the fs.PathDir() method.


pkg/kv/kvserver/logstore/sideload_disk.go line 300 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We don't need to complicate the code if it doesn't feel worthwhile to you, but enumerating all sub-paths and constructing individual path strings to test feels a bit wasteful. I wonder whether iterative calls to path.Split over a precomputed path string would make this logic easier to reason about.

We can't use the path package because the path syntax may depend on the platform. We can't use path/filepath package either because the path syntax can be specific to the vfs.FS implementation (e.g. MemFS), and not match the platform syntax. We thus have to use the vfs.FS helpers for manipulating paths. These and are somewhat underspecified in how they handle absolute/relative paths and root or '.' directories, so in my initial implementation I chose to not use them.

Changed the code to use the fs.PathDir() helper though, PTAL.


pkg/kv/kvserver/logstore/sideload_disk.go line 317 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Walking all the way up to the Pebble root directory and not assuming that "auxiliary" exists makes sense to me.

Or alternatively we could remove the distinction between the base directory path and the sub-directories we are trying to create and just create+sync whatever is necessary without making assumptions. This is analogous to what MkdirAll does.

Done the latter.


pkg/kv/kvserver/logstore/sideload_test.go line 594 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this needed? Does eng.Close() try to sync before closing?

The idea of this test is that we discard all the writes/syncs below, including the ones that eng.Close() might potentially do to gracefully terminate things. We are emulating a crash (see the comment below).

Correspondingly, calling ss.Sync() makes a difference to whether the entry persists or not (see the asserts below).

@pav-kv pav-kv force-pushed the sideload-files-sync branch 2 times, most recently from 1f1693c to ca52afa Compare December 4, 2023 13:14
@pav-kv
Copy link
Collaborator Author

pav-kv commented Dec 4, 2023

@erikgrinaker @nvanbenschoten Addressed all your comments. PTAL.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the hold-up, appreciate the patience.

}
parent = fs.PathDir(dir)
// NB: not checking against the separator, to be platform-agnostic.
if dir == "." || parent == dir { // reached the topmost dir or the root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of special-casing . (and what about ..?), we could simply call filepath.Abs() on the path first, which will resolve relative paths and normalize out any . and .. components.

Nevermind, I see your comment above that we have to rely on Pebble helpers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have no idea how these helpers behave with paths containing .., or in general with non-canonical paths. These are extremely brittle, and I think we should not use them. It would be cleaner to have a type-safe "canonical path" notion which is known to be relative to the data directory root which exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That probably means that we have similar challenges in Pebble itself. If this code is similar to what's already done in storage/Pebble that's probably good enough.

Copy link
Collaborator Author

@pav-kv pav-kv Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assumptions are:

  • Pebble root directory exists.
  • All paths within this directory are well-formed / canonical suffixes of the pebble root path.
  • We never walk above the pebble root, so we're effectively always iterating nice canonical substrings.

All these things could be made explicit and constrained within a type-safe helper, so that the space of things we should worry about is reduced. Posted some ideas internally.

defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)

for _, tc := range []struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding some test cases here with relative paths as well as intermediate . and .. components.

Copy link
Collaborator Author

@pav-kv pav-kv Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemFS doesn't work well with those. For example, a/b/c0/../c1 creates a/b/c0/../c1 (literally), and after a restart only a/b/c0 is persisted (possibly because the MemFS.PathDir() loop walks a different hierarchy than MemFS.MkdirAll).

Similarly, a/b/c0/./c1 creates a/b/c0/./c1 (literally), and after a restart only a/b/c0 is persisted.

Makes little sense for now ¯\_(ツ)_/¯, will take a look in the morning. Exactly why I didn't want to rely on PathDir() helpers in the first place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some relative paths and .. things. Also added a TODO to make these types of things safer.

@erikgrinaker
Copy link
Contributor

How do you feel about backporting this to 23.2, but leaving the older backports to bake for a while? That should give us a month of baking time for 23.2, which seems sufficient.

The sideloaded log storage does not sync the hierarchy of directories it
creates. This can potentially lead to full or partial loss of its
sub-directories in case of a system crash or power off.

After this commit, every time sideloaded storage creates a directory, it
syncs its parent so that the reference is durable.

Epic: none
Release note: none
The sideloaded storage fsyncs the files that it creates. Even though
this guarantees durability of the files content and metadata, this still
does not guarantee that the references to these files are durable. For
example, Linux man page for fsync [^1] says:

```
Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.
```

It means that these files can be lost after a system crash of power off.
This leads to issues:

1. The storage syncs are not atomic with the sideloaded files syncs. It
   is thus possible that raft log metadata "references" a sideloaded
   file and gets synced, but the file is not yet. A power off/on at
   this point leads to an internal inconsistency, and can result in a
   crash loop when raft will try to load these entries to apply and/or
   send to other replicas.

2. The durability of entry files is used as a pre-condition to sending
   raft messages that trigger committing these entries. A coordinated
   power off/on on a majority of replicas can thus lead to losing
   committed entries and unrecoverable loss-of-quorum.

This commit fixes the above issues, by syncing the parent directory
after writing sideloaded entry files. The natural point for this is
MaybeSideloadEntries on the handleRaftReady path.

[^1]: https://man7.org/linux/man-pages/man2/fsync.2.html

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log
storage, caused by incorrect syncing of filesystem metadata. It was
possible to lose writes of a particular kind (AddSSTable) that is e.g.
used by RESTORE. This loss was possible only under power-off or OS crash
conditions. As a result, CRDB could enter a crash loop on restart. In
the worst case of a coordinated power-off/crash across multiple nodes
this could lead to an unrecoverable loss of quorum.
@pav-kv pav-kv added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Dec 6, 2023
@pav-kv
Copy link
Collaborator Author

pav-kv commented Dec 6, 2023

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Dec 6, 2023

Build succeeded:

@craig craig bot merged commit 358909d into cockroachdb:master Dec 6, 2023
8 of 9 checks passed
@pav-kv pav-kv deleted the sideload-files-sync branch December 6, 2023 18:02
@pav-kv
Copy link
Collaborator Author

pav-kv commented Dec 8, 2023

blathers backport 23.2.0-rc

@pav-kv
Copy link
Collaborator Author

pav-kv commented Dec 8, 2023

blathers backport 23.1 22.2

Copy link

blathers-crl bot commented Dec 8, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 4d1f243 to blathers/backport-release-23.1-114616: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


error creating merge commit from 4d1f243 to blathers/backport-release-22.2-114616: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 3, 2024

blathers backport 23.1

Copy link

blathers-crl bot commented Jan 3, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 4d1f243 to blathers/backport-release-23.1-114616: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants