Use fcntl(F_FULLFSYNC) on OS X #9356

riversand963 · 2022-01-04T18:12:48Z

Closing #5954

fsync/fdatasync on Linux:

(fsync/fdatasync) includes writing through or flushing a disk cache if present.

However, on OS X and iOS:

(fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"),
the drive itself may not physically write the data to the platters for quite some time and it
may be written in an out-of-order sequence.

Solution is to use fcntl(F_FULLFSYNC) on OS X so that we get the same
persistence guarantee.

According to OSX man page,

The F_FULLFSYNC fcntl asks the drive to flush **all** buffered data to permanent storage.

This suggests that it will be no faster than fsync on Linux, since Linux, according to its man page,

writing through or flushing a disk cache if present

It means Linux may not flush all data from disk cache.

This is similar to bug reports/fixes in:

golang: os: File.Sync() for Darwin should use the F_FULLFSYNC fcntl golang/go#26650
leveldb: google/leveldb@296de8d.
Not sure if we should fallback to fsync since we break persistence contract.

facebook-github-bot · 2022-01-05T00:57:51Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jay-zhuang

LGTM. Do we know the performance impact of this change? Maybe worth to mention that in the HISTORY.md.

riversand963 · 2022-01-18T23:05:31Z

Thanks @jay-zhuang for the review.
I do not have the numbers about performance overhead. If there is, it's probably something we need to live with for OSX.

riversand963 · 2022-01-18T23:42:08Z

@jay-zhuang updated PR description to mention potential performance impact.

facebook-github-bot · 2022-01-18T23:44:34Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-18T23:46:14Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

fsync/fdatasync on Linux: ``` (fsync/fdatasync) includes writing through or flushing a disk cache if present. ``` However, on OS X and iOS: ``` (fsync) will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence. ``` Solution is to use `fcntl(F_FULLFSYNC)` on OS X so that we get the same persistence guarantee. This is similar to bug reports/fixes in: - golan: golang/go#26650 - leveldb: google/leveldb@296de8d. Not sure if we should fallback to fsync since we break persistence contract.

facebook-github-bot · 2022-01-19T01:38:52Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-19T01:39:55Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-19T01:41:14Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mrambacher · 2022-01-19T22:56:24Z

This change has a huge performance impact. Running tests on MacOs now takes significantly more time.

With this change:
[----------] 2 tests from DBIteratorTestInstance/DBIteratorTest
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0 (20533 ms)
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1 (20606 ms)
[----------] 2 tests from DBIteratorTestInstance/DBIteratorTest (41139 ms total)

Prior to this change:
tests from DBIteratorTestInstance/DBIteratorTest
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0 (913 ms)
[ RUN ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1
[ OK ] DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/1 (910 ms)
[----------] 2 tests from DBIteratorTestInstance/DBIteratorTest (1823 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (1823 ms total)
[ PASSED ] 2 tests.

This is just one example test of many that are significantly slower.

ajkr · 2022-01-19T23:50:57Z

We can disable fsync in tests:

rocksdb/db/db_test_util.h

Lines 916 to 919 in 875bfd7

    
           // `env_do_fsync` decides whether the special Env would do real 
        
           // fsync for files and directories. Skipping fsync can speed up 
        
           // tests, but won't cover the exact fsync logic. 
        
           DBTestBase(const std::string path, bool env_do_fsync);

.

A non-comprehensive search found it set to true in 16/69 places, but it seems unlikely there's really 53+ tests that need to fsync.

riversand963 · 2022-01-19T23:54:57Z

To be fair, the above unit test (DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/*) amplifies the performance overhead caused by F_FULLFSYNC by calling fcntl(F_FULLFSYNC) 1684 times in total.
I applied the diff

diff --git a/env/io_posix.cc b/env/io_posix.cc
index 3e519dce4..6f170f9e6 100644
--- a/env/io_posix.cc
+++ b/env/io_posix.cc
@@ -1109,6 +1109,7 @@ IOStatus PosixMmapFile::Flush(const IOOptions& /*opts*/,
 IOStatus PosixMmapFile::Sync(const IOOptions& /*opts*/,
                              IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLSYNC) mmapped file", filename_, errno);
   }
@@ -1127,6 +1128,7 @@ IOStatus PosixMmapFile::Sync(const IOOptions& /*opts*/,
 IOStatus PosixMmapFile::Fsync(const IOOptions& /*opts*/,
                               IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("While fcntl(F_FULLSYNC) on mmaped file", filename_, errno);
   }
@@ -1333,6 +1335,7 @@ IOStatus PosixWritableFile::Flush(const IOOptions& /*opts*/,
 IOStatus PosixWritableFile::Sync(const IOOptions& /*opts*/,
                                  IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC)", filename_, errno);
   }
@@ -1347,6 +1350,7 @@ IOStatus PosixWritableFile::Sync(const IOOptions& /*opts*/,
 IOStatus PosixWritableFile::Fsync(const IOOptions& /*opts*/,
                                   IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC)", filename_, errno);
   }
@@ -1528,6 +1532,7 @@ IOStatus PosixRandomRWFile::Flush(const IOOptions& /*opts*/,
 IOStatus PosixRandomRWFile::Sync(const IOOptions& /*opts*/,
                                  IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC) random rw file", filename_, errno);
   }
@@ -1542,6 +1547,7 @@ IOStatus PosixRandomRWFile::Sync(const IOOptions& /*opts*/,
 IOStatus PosixRandomRWFile::Fsync(const IOOptions& /*opts*/,
                                   IODebugContext* /*dbg*/) {
 #ifdef HAVE_FULLFSYNC
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("While fcntl(F_FULLSYNC) random rw file", filename_, errno);
   }
@@ -1625,6 +1631,7 @@ IOStatus PosixDirectory::FsyncWithDirOptions(
   // btrfs is a Linux file system, while currently F_FULLFSYNC is available on
   // Mac OS.
   assert(!is_btrfs_);
+  fprintf(stdout, "fsync\n");
   if (::fcntl(fd_, F_FULLFSYNC) < 0) {
     return IOError("while fcntl(F_FULLFSYNC)", "a directory", errno);
   }

mrambacher · 2022-01-21T02:08:13Z

Is there anything that can be done to make this better? A make -j check is taking an incredibly long time on the Mac now. Many of the tests take an extra 5-10 seconds per test case. Testing was already slower on the Mac than Linux but this has made it unbearably worse. To run the table_test suite on my machine takes 15x as long as it did prior to this change. Other tests (as mentioned earlier) also are much slower.

I have spent no time trying to determine why the code is doing the sync, but can see that this change is what is causing the issues.

riversand963 · 2022-01-21T04:33:43Z

I can think of a few things.

Run the tests on memory-backed device (https://stackoverflow.com/questions/2033362/does-os-x-have-an-equivalent-to-dev-shm)
Add an environment variable, ENV_DO_FSYNC, which can be picked up by the unit tests when creating the Env obj. Note that this may cause some tests to fail (I haven't tried so don't know).
If there is already a PR, then you can rely on CIs. In addition, build and run locally the individual tests that fail. This is what I do when I debug Windows-related issues on an Intel NUC.

Should 2 be considered, I would also suggest that #9356 (comment) be investigated.

mrambacher · 2022-01-21T13:51:03Z

The CircleCI tests are also failing sporadically on the Mac with "deadline exceeded". I submitted a PR that is not used for most of the tests (only a new test) and it still fails, so the CI environment cannot be relied upon.

I was going to suggest a compile time variable such as ROCKSDB_DISABLE_FULL_FSYNC, similar to what is done with ROCKSDB_DISABLE_SYNC_IN_RANGE and ROCKSDB_DISABLE_FALLOCATE. Is there a reason not to implement such a solution?

If I have time, I will attempt to figure out how many calls (and from where) to fsync we are making for these tests, and how long each type of call is taking. At the moment, this change is making testing on the Mac very painful and is killing performance. While that might be necessary for data integrity, it should be something that the user opts into, not that is always on.

riversand963 · 2022-01-21T17:12:54Z

The CircleCI tests are also failing sporadically on the Mac with "deadline exceeded"

This has been happening even before this commit. At present, there has not been history long enough to demonstrate this commit is a problem for CI. We will keep an eye on this.

I was going to suggest a compile time variable such as ROCKSDB_DISABLE_FULL_FSYNC

So far, there has been no evidence of this commit hurting the overall performance of applications that use rocksdb library. I would be surprised if so because we keep fsync outside dbmutex and often in background, and RocksDB doesn't fsync very often in real workloads. Some of the unit tests become slower because they sync way more frequent. If this is a problem, it should be resolved in unit test instead adding a flag leading to different library binaries. I am fine with either DBTestBase(..., env_do_fsync=false) or using an environment variable similar to KEEP_DB, but not a compile time flag.

mrambacher · 2022-01-21T19:16:14Z

The CircleCI tests are also failing sporadically on the Mac with "deadline exceeded"

This has been happening even before this commit. At present, there has not been history long enough to demonstrate this commit is a problem for CI. We will keep an eye on this.

I was going to suggest a compile time variable such as ROCKSDB_DISABLE_FULL_FSYNC

So far, there has been no evidence of this commit hurting the overall performance of applications that use rocksdb library. I would be surprised if so because we keep fsync outside dbmutex and often in background, and RocksDB doesn't fsync very often in real workloads. Some of the unit tests become slower because they sync way more frequent. If this is a problem, it should be resolved in unit test instead adding a flag leading to different library binaries. I am fine with either DBTestBase(..., env_do_fsync=false) or using an environment variable similar to KEEP_DB, but not a compile time flag.

The TableTest/ParameterizedHarnessTest.RandomizedHarnessTest suite takes 15x what it used to on my Mac. It does not call sync/fsync at all in the test code and uses the default Env. One test on my machine that now takes 14000+ ms performed 305 syncs + 305 fsyncs. There are no calls to sync in the test itself -- it is writing data and doing compaction.

I agree that this test probably does not need to do fsync but it would need to be changed to use a "NoSyncFileSystem". Additionally, we would need to track down all of the other tests that are much slower because of this and change them as well.

And if anyone is in production on a Mac, I would guess they will not be pleasantly surprised at how long these operations now take.

riversand963 · 2022-01-21T19:50:48Z

One test on my machine that now takes 14000+ ms performed 305 syncs + 305 fsyncs

This PR does not change how many Sync() or FSync() are called. It just replaces the problematic fsync() with fcntl() for each FSync() and Sync() on OS X. If there are too many such calls, then this is a problem of the unit test. IMO, we should fix unit tests for libraries instead of fixing libraries for unit tests.

TableTest/ParameterizedHarnessTest.RandomizedHarnessTest

This is a parameterized tests. For each test case, it will create new db at least once involving syncing dir and file sync. Check GenerateArgList and see how many different combinations are generated.

FWIW, there are several things to consider

Run the tests on OS-X's equivalent of /dev/shm
Add an environment variable, ENV_DO_FSYNC, which can be picked up by the unit tests when creating the Env obj. Note that this may cause some tests to fail. This approach is similar to what we do with KEEP_DB env var. Also investigate what tests actually need the sync/fsync.
If there is already a PR, then CIs can always be used.

riversand963 · 2022-01-21T19:54:41Z

You can specify TEST_TMPDIR when running unit tests.

mrambacher · 2022-01-22T12:49:11Z

Not all tests (table_test for example) go through SpecialEnv so using that "flag" will not work.

Furthermore, table_test does not do anything special -- build a table, compaction, gets and seeks. It does not ever explicitly call sync. Each of the table_test ParameterizedHarnessTest that use a DB used to take less than 1 second and now take well over 4+ seconds.

Some investigation shows that the test with args "reverse=0, restart=16 compression=kBZip2Compression threads=4, format=2 mmap=0" is doing:
Num files opened: 976
Num files deleted: 543
Num files renamed: 304
Num Flush(): 5430
Num Sync(): 305
Num Dsync(): 244
Num Fsync(): 61
Num Close(): 549
Num Read(): 366
Num Append(): 5003
Num bytes read: 402234
Num bytes written: 553120

Each of these "sync" operations appears to be taking around an additional 23 ms (I have disabled them all individually and it does not appear any one class of operation is taking longer than another).

Perhaps the 23 ms per operation is a reasonable "tax" to pay. But the cumulative total for this test is over 14+ seconds.

I will code up a FileSystem that can be taught to skip these various operations that we can use for testing. But perhaps someone should investigate why we are doing so many of them and if it is really necessary? These are fairly simple tests of a small database. What will happen in a "real world" scenario?

riversand963 · 2022-01-22T20:06:56Z

Can you try the following? It can easily be wrapped in a script to support running make -j check.
Details can be found at https://gist.github.com/rxin/5085564.

I actually tried the following, the test execution time went down to 729ms for OS X.

$ hdiutil attach -nomount ram://524288
/dev/disk2
$ diskutil erasevolume HFS+ "ramdisk" /dev/disk2
$ TEST_TMPDIR=/Volumes/ramdisk ./db_iterator_test --gtest_filter='DBIteratorTestInstance/DBIteratorTest.IterWithSnapshot/0'

riversand963 · 2022-01-22T20:16:10Z

I will code up a FileSystem that can be taught to skip these various operations that we can use for testing

With the approach of using a Ramdisk on Mac, this adds little value while introducing unnecessary code.

someone should investigate why we are doing so many of them and if it is really necessary

This test simple, but it still requires NewDB(), each of which calls SyncManifest(). See https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl_open.cc#L332. The test code in this TEST_P or TEST_F are just the surface, we need to look deep into what actually happens.

These are fairly simple tests of a small database. What will happen in a "real world" scenario?

I doubt whether there is real world scenario repeatedly calling NewDB() 1000+ times within 1 min on a single OS X machine. So far, all your comments are based on unit tests.

facebook-github-bot added the CLA Signed label Jan 4, 2022

riversand963 linked an issue Jan 4, 2022 that may be closed by this pull request

fsync on macos should use fcntl (fd, F_FULLFSYNC) #5954

Closed

jay-zhuang approved these changes Jan 18, 2022

View reviewed changes

riversand963 force-pushed the osx-fsync branch from ff6465f to 3819e15 Compare January 18, 2022 23:44

riversand963 added 3 commits January 18, 2022 17:37

Detect presence of F_FULLFSYNC

fe84af2

Add fcntl(F_FULLFSYNC)

73e1975

riversand963 force-pushed the osx-fsync branch from 3819e15 to 69bcbf9 Compare January 19, 2022 01:38

Update HISTORY

19ea243

riversand963 force-pushed the osx-fsync branch from 69bcbf9 to 19ea243 Compare January 19, 2022 01:39

facebook-github-bot closed this in 1a8e9f0 Jan 19, 2022

riversand963 deleted the osx-fsync branch January 19, 2022 04:26

alanpaxton mentioned this pull request Dec 20, 2022

RocksDBStore.openDB performance degradation after version 6.27.3 on Mac #11035

Open

mrambacher mentioned this pull request Dec 28, 2022

Add F_BARRIERFSYNC for Sync operations on MacOS speedb-io/speedb#319

Merged

alanpaxton mentioned this pull request Jan 7, 2025

Avoid slow CI tests on MACOS_X using build flag #13276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fcntl(F_FULLFSYNC) on OS X #9356

Use fcntl(F_FULLFSYNC) on OS X #9356

riversand963 commented Jan 4, 2022 •

edited

Loading

facebook-github-bot commented Jan 5, 2022

jay-zhuang left a comment

riversand963 commented Jan 18, 2022

riversand963 commented Jan 18, 2022

facebook-github-bot commented Jan 18, 2022

facebook-github-bot commented Jan 18, 2022

facebook-github-bot commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

mrambacher commented Jan 19, 2022

ajkr commented Jan 19, 2022

riversand963 commented Jan 19, 2022 •

edited

Loading

mrambacher commented Jan 21, 2022

riversand963 commented Jan 21, 2022

mrambacher commented Jan 21, 2022

riversand963 commented Jan 21, 2022

mrambacher commented Jan 21, 2022

riversand963 commented Jan 21, 2022

riversand963 commented Jan 21, 2022 •

edited

Loading

mrambacher commented Jan 22, 2022

riversand963 commented Jan 22, 2022

riversand963 commented Jan 22, 2022

Use fcntl(F_FULLFSYNC) on OS X #9356

Use fcntl(F_FULLFSYNC) on OS X #9356

Conversation

riversand963 commented Jan 4, 2022 • edited Loading

facebook-github-bot commented Jan 5, 2022

jay-zhuang left a comment

Choose a reason for hiding this comment

riversand963 commented Jan 18, 2022

riversand963 commented Jan 18, 2022

facebook-github-bot commented Jan 18, 2022

facebook-github-bot commented Jan 18, 2022

facebook-github-bot commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

mrambacher commented Jan 19, 2022

ajkr commented Jan 19, 2022

riversand963 commented Jan 19, 2022 • edited Loading

mrambacher commented Jan 21, 2022

riversand963 commented Jan 21, 2022

mrambacher commented Jan 21, 2022

riversand963 commented Jan 21, 2022

mrambacher commented Jan 21, 2022

riversand963 commented Jan 21, 2022

riversand963 commented Jan 21, 2022 • edited Loading

mrambacher commented Jan 22, 2022

riversand963 commented Jan 22, 2022

riversand963 commented Jan 22, 2022

riversand963 commented Jan 4, 2022 •

edited

Loading

riversand963 commented Jan 19, 2022 •

edited

Loading

riversand963 commented Jan 21, 2022 •

edited

Loading