Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fcntl(F_FULLFSYNC) on OS X #9356

Closed
wants to merge 4 commits into from

Conversation

riversand963
Copy link
Contributor

@riversand963 riversand963 commented Jan 4, 2022

Closing #5954

fsync/fdatasync on Linux:

(fsync/fdatasync) includes writing through or flushing a disk cache if present.

However, on OS X and iOS:

(fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"),
the drive itself may not physically write the data to the platters for quite some time and it
may be written in an out-of-order sequence.

Solution is to use fcntl(F_FULLFSYNC) on OS X so that we get the same
persistence guarantee.

According to OSX man page,

The F_FULLFSYNC fcntl asks the drive to flush **all** buffered data to permanent storage.

This suggests that it will be no faster than fsync on Linux, since Linux, according to its man page,

writing through or flushing a disk cache if present

It means Linux may not flush all data from disk cache.

This is similar to bug reports/fixes in:

@riversand963 riversand963 linked an issue Jan 4, 2022 that may be closed by this pull request
@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@jay-zhuang jay-zhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Do we know the performance impact of this change? Maybe worth to mention that in the HISTORY.md.

@riversand963
Copy link
Contributor Author

Thanks @jay-zhuang for the review.
I do not have the numbers about performance overhead. If there is, it's probably something we need to live with for OSX.

@riversand963
Copy link
Contributor Author

@jay-zhuang updated PR description to mention potential performance impact.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

fsync/fdatasync on Linux:
```
(fsync/fdatasync) includes writing through or flushing a disk cache if present.
```

However, on OS X and iOS:
```
(fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"),
the drive itself may not physically write the data to the platters for quite some time and it
may be written in an out-of-order sequence.
```

Solution is to use `fcntl(F_FULLFSYNC)` on OS X so that we get the same
persistence guarantee.

This is similar to bug reports/fixes in:

- golan: golang/go#26650

- leveldb:
  google/leveldb@296de8d.
  Not sure if we should fallback to fsync since we break persistence contract.
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963 riversand963 deleted the osx-fsync branch January 19, 2022 04:26
@mrambacher
Copy link
Contributor

This change has a huge performance impact. Running tests on MacOs now takes significantly more time.

With this change:
[----------] 2 tests from DBIteratorTestInstance/DBIteratorTest
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0 (20533 ms)
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1 (20606 ms)
[----------] 2 tests from DBIteratorTestInstance/DBIteratorTest (41139 ms total)

Prior to this change:
tests from DBIteratorTestInstance/DBIteratorTest
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0 (913 ms)
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1 (910 ms)
[----------] 2 tests from DBIteratorTestInstance/DBIteratorTest (1823 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1823 ms total)
[ PASSED ] 2 tests.

This is just one example test of many that are significantly slower.

@ajkr
Copy link
Contributor

ajkr commented Jan 19, 2022

We can disable fsync in tests:

rocksdb/db/db_test_util.h

Lines 916 to 919 in 875bfd7

// `env_do_fsync` decides whether the special Env would do real
// fsync for files and directories. Skipping fsync can speed up
// tests, but won't cover the exact fsync logic.
DBTestBase(const std::string path, bool env_do_fsync);
.

A non-comprehensive search found it set to true in 16/69 places, but it seems unlikely there's really 53+ tests that need to fsync.

@riversand963
Copy link
Contributor Author

riversand963 commented Jan 19, 2022

To be fair, the above unit test (DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/*) amplifies the performance overhead caused by F_FULLFSYNC by calling fcntl(F_FULLFSYNC) 1684 times in total.
I applied the diff

diff --git a/env/io_posix.cc b/env/io_posix.cc
index 3e519dce4..6f170f9e6 100644
--- a/env/io_posix.cc
+++ b/env/io_posix.cc
@@ -1109,6 +1109,7 @@ IOStatus PosixMmapFile::Flush(const IOOptions& /*opts*/,
 IOStatus PosixMmapFile::Sync(const IOOptions& /*opts*/,
                              IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLSYNC) mmapped file", filename_, errno);
   }
@@ -1127,6 +1128,7 @@ IOStatus PosixMmapFile::Sync(const IOOptions& /*opts*/,
 IOStatus PosixMmapFile::Fsync(const IOOptions& /*opts*/,
                               IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("While fcntl(F_FULLSYNC) on mmaped file", filename_, errno);
   }
@@ -1333,6 +1335,7 @@ IOStatus PosixWritableFile::Flush(const IOOptions& /*opts*/,
 IOStatus PosixWritableFile::Sync(const IOOptions& /*opts*/,
                                  IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC)", filename_, errno);
   }
@@ -1347,6 +1350,7 @@ IOStatus PosixWritableFile::Sync(const IOOptions& /*opts*/,
 IOStatus PosixWritableFile::Fsync(const IOOptions& /*opts*/,
                                   IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC)", filename_, errno);
   }
@@ -1528,6 +1532,7 @@ IOStatus PosixRandomRWFile::Flush(const IOOptions& /*opts*/,
 IOStatus PosixRandomRWFile::Sync(const IOOptions& /*opts*/,
                                  IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC) random rw file", filename_, errno);
   }
@@ -1542,6 +1547,7 @@ IOStatus PosixRandomRWFile::Sync(const IOOptions& /*opts*/,
 IOStatus PosixRandomRWFile::Fsync(const IOOptions& /*opts*/,
                                   IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("While fcntl(F_FULLSYNC) random rw file", filename_, errno);
   }
@@ -1625,6 +1631,7 @@ IOStatus PosixDirectory::FsyncWithDirOptions(
   // btrfs is a Linux file system, while currently F_FULLFSYNC is available on
   // Mac OS.
   assert(!is_btrfs_);
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC)", "a directory", errno);
   }

@mrambacher
Copy link
Contributor

Is there anything that can be done to make this better? A make -j check is taking an incredibly long time on the Mac now. Many of the tests take an extra 5-10 seconds per test case. Testing was already slower on the Mac than Linux but this has made it unbearably worse. To run the table_test suite on my machine takes 15x as long as it did prior to this change. Other tests (as mentioned earlier) also are much slower.

I have spent no time trying to determine why the code is doing the sync, but can see that this change is what is causing the issues.

@riversand963
Copy link
Contributor Author

I can think of a few things.

  1. Run the tests on memory-backed device (https://stackoverflow.com/questions/2033362/does-os-x-have-an-equivalent-to-dev-shm)
  2. Add an environment variable, ENV_DO_FSYNC, which can be picked up by the unit tests when creating the Env obj. Note that this may cause some tests to fail (I haven't tried so don't know).
  3. If there is already a PR, then you can rely on CIs. In addition, build and run locally the individual tests that fail. This is what I do when I debug Windows-related issues on an Intel NUC.

Should 2 be considered, I would also suggest that #9356 (comment) be investigated.

@mrambacher
Copy link
Contributor

The CircleCI tests are also failing sporadically on the Mac with "deadline exceeded". I submitted a PR that is not used for most of the tests (only a new test) and it still fails, so the CI environment cannot be relied upon.

I was going to suggest a compile time variable such as ROCKSDB_DISABLE_FULL_FSYNC, similar to what is done with ROCKSDB_DISABLE_SYNC_IN_RANGE and ROCKSDB_DISABLE_FALLOCATE. Is there a reason not to implement such a solution?

If I have time, I will attempt to figure out how many calls (and from where) to fsync we are making for these tests, and how long each type of call is taking. At the moment, this change is making testing on the Mac very painful and is killing performance. While that might be necessary for data integrity, it should be something that the user opts into, not that is always on.

@riversand963
Copy link
Contributor Author

The CircleCI tests are also failing sporadically on the Mac with "deadline exceeded"

This has been happening even before this commit. At present, there has not been history long enough to demonstrate this commit is a problem for CI. We will keep an eye on this.

I was going to suggest a compile time variable such as ROCKSDB_DISABLE_FULL_FSYNC

So far, there has been no evidence of this commit hurting the overall performance of applications that use rocksdb library. I would be surprised if so because we keep fsync outside dbmutex and often in background, and RocksDB doesn't fsync very often in real workloads. Some of the unit tests become slower because they sync way more frequent. If this is a problem, it should be resolved in unit test instead adding a flag leading to different library binaries. I am fine with either DBTestBase(..., env_do_fsync=false) or using an environment variable similar to KEEP_DB, but not a compile time flag.

@mrambacher
Copy link
Contributor

The CircleCI tests are also failing sporadically on the Mac with "deadline exceeded"

This has been happening even before this commit. At present, there has not been history long enough to demonstrate this commit is a problem for CI. We will keep an eye on this.

I was going to suggest a compile time variable such as ROCKSDB_DISABLE_FULL_FSYNC

So far, there has been no evidence of this commit hurting the overall performance of applications that use rocksdb library. I would be surprised if so because we keep fsync outside dbmutex and often in background, and RocksDB doesn't fsync very often in real workloads. Some of the unit tests become slower because they sync way more frequent. If this is a problem, it should be resolved in unit test instead adding a flag leading to different library binaries. I am fine with either DBTestBase(..., env_do_fsync=false) or using an environment variable similar to KEEP_DB, but not a compile time flag.

The TableTest/ParameterizedHarnessTest.RandomizedHarnessTest suite takes 15x what it used to on my Mac. It does not call sync/fsync at all in the test code and uses the default Env. One test on my machine that now takes 14000+ ms performed 305 syncs + 305 fsyncs. There are no calls to sync in the test itself -- it is writing data and doing compaction.

I agree that this test probably does not need to do fsync but it would need to be changed to use a "NoSyncFileSystem". Additionally, we would need to track down all of the other tests that are much slower because of this and change them as well.

And if anyone is in production on a Mac, I would guess they will not be pleasantly surprised at how long these operations now take.

@riversand963
Copy link
Contributor Author

One test on my machine that now takes 14000+ ms performed 305 syncs + 305 fsyncs

This PR does not change how many Sync() or FSync() are called. It just replaces the problematic fsync() with fcntl() for each FSync() and Sync() on OS X. If there are too many such calls, then this is a problem of the unit test. IMO, we should fix unit tests for libraries instead of fixing libraries for unit tests.

TableTest/ParameterizedHarnessTest.RandomizedHarnessTest

This is a parameterized tests. For each test case, it will create new db at least once involving syncing dir and file sync. Check GenerateArgList and see how many different combinations are generated.

FWIW, there are several things to consider

  • Run the tests on OS-X's equivalent of /dev/shm
  • Add an environment variable, ENV_DO_FSYNC, which can be picked up by the unit tests when creating the Env obj. Note that this may cause some tests to fail. This approach is similar to what we do with KEEP_DB env var. Also investigate what tests actually need the sync/fsync.
  • If there is already a PR, then CIs can always be used.

@riversand963
Copy link
Contributor Author

riversand963 commented Jan 21, 2022

You can specify TEST_TMPDIR when running unit tests.

@mrambacher
Copy link
Contributor

Not all tests (table_test for example) go through SpecialEnv so using that "flag" will not work.

Furthermore, table_test does not do anything special -- build a table, compaction, gets and seeks. It does not ever explicitly call sync. Each of the table_test ParameterizedHarnessTest that use a DB used to take less than 1 second and now take well over 4+ seconds.

Some investigation shows that the test with args "reverse=0, restart=16 compression=kBZip2Compression threads=4, format=2 mmap=0" is doing:
Num files opened: 976
Num files deleted: 543
Num files renamed: 304
Num Flush(): 5430
Num Sync(): 305
Num Dsync(): 244
Num Fsync(): 61
Num Close(): 549
Num Read(): 366
Num Append(): 5003
Num bytes read: 402234
Num bytes written: 553120

Each of these "sync" operations appears to be taking around an additional 23 ms (I have disabled them all individually and it does not appear any one class of operation is taking longer than another).

Perhaps the 23 ms per operation is a reasonable "tax" to pay. But the cumulative total for this test is over 14+ seconds.

I will code up a FileSystem that can be taught to skip these various operations that we can use for testing. But perhaps someone should investigate why we are doing so many of them and if it is really necessary? These are fairly simple tests of a small database. What will happen in a "real world" scenario?

@riversand963
Copy link
Contributor Author

Can you try the following? It can easily be wrapped in a script to support running make -j check.
Details can be found at https://gist.github.com/rxin/5085564.

I actually tried the following, the test execution time went down to 729ms for OS X.

$ hdiutil attach -nomount ram://524288
/dev/disk2
$ diskutil erasevolume HFS+ "ramdisk" /dev/disk2
$ TEST_TMPDIR=/Volumes/ramdisk ./db_iterator_test --gtest_filter='DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0'

@riversand963
Copy link
Contributor Author

I will code up a FileSystem that can be taught to skip these various operations that we can use for testing

With the approach of using a Ramdisk on Mac, this adds little value while introducing unnecessary code.

someone should investigate why we are doing so many of them and if it is really necessary

This test simple, but it still requires NewDB(), each of which calls SyncManifest(). See https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl_open.cc#L332. The test code in this TEST_P or TEST_F are just the surface, we need to look deep into what actually happens.

These are fairly simple tests of a small database. What will happen in a "real world" scenario?

I doubt whether there is real world scenario repeatedly calling NewDB() 1000+ times within 1 min on a single OS X machine. So far, all your comments are based on unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fsync on macos should use fcntl (fd, F_FULLFSYNC)
5 participants