-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support read & write with unsynced data in FaultInjectionTestFS #12852
Support read & write with unsynced data in FaultInjectionTestFS #12852
Conversation
Summary: Follow-up to facebook#12729 and others to fix FaultInjectionTestFS handling the case where a live WAL is being appended to and synced while also being copied for checkpoint or backup, up to a known flushed (but not necessarily synced) prefix of the file. A mutex handles concurrency; we just have to deal with interleavings of legitimate operations. For more context, see the call to FlushWAL() in DBImpl::GetLiveFilesStorageInfo(). Also, the unit test for facebook#12729 was neutered by facebook#12797, and this re-enables the functionality it is testing. Test Plan: unit test expanded/updated
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
utilities/fault_injection_fs.cc
Outdated
s = target()->Read(n, options, result, scratch, dbg); | ||
if (!s.ok()) { | ||
return s; | ||
} | ||
|
||
assert(result->size() == 0 || target_read_pos_ == read_pos_); | ||
target_read_pos_ += result->size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the file is synced here? It seems we may not read all unsynced data since the buffer will be cleared before we try to read from it. Should we lock mutex_ before target()->Read() above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! I've almost completely revamped the approach to make it resilient to this race without killing performance or making the implementation or testing a nightmare.
…ad_and_write_unsync
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some minor comment and question. The rest LGTM, thanks for fixing this tricky problem!
void MoveToScratchIfNeeded(Slice* result, char* scratch) { | ||
if (result->data() != scratch) { | ||
// NOTE: might overlap | ||
std::copy_n(result->data(), result->size(), scratch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it okay to use copy_n if they might overlap? Documentation says it can lead to unpredictable ordering of the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I must have skimmed over fast enough to see its behavior wasn't "undefined" like memcpy, but apparently the ordering of the copying is undefined. I'll switch to std::copy.
Actually, I'll address both of those things in an immediate follow-up PR |
@pdillinger merged this pull request in 72438a6. |
Summary: In follow-up to facebook#12852: * Use std::copy in place of copy_n for potentially overlapping buffer * Get rid of troublesome -1 idiom from `pos_at_last_append_` and `pos_at_last_sync_` * Small improvements to FaultInjectionFSTest.ReadUnsyncedData Test Plan: CI, crash test, etc.
Summary: In follow-up to #12852: * Use std::copy in place of copy_n for potentially overlapping buffer * Get rid of troublesome -1 idiom from `pos_at_last_append_` and `pos_at_last_sync_` * Small improvements to test FaultInjectionFSTest.ReadUnsyncedData Pull Request resolved: #12861 Test Plan: CI, crash test, etc. Reviewed By: cbi42 Differential Revision: D59757484 Pulled By: pdillinger fbshipit-source-id: c6fbdc2e97c959983184925a855cc8b0285fa23f
Summary: Follow-up to #12729 and others to fix FaultInjectionTestFS handling the case where a live WAL is being appended to and synced while also being copied for checkpoint or backup, up to a known flushed (but not necessarily synced) prefix of the file. It was tricky to structure the code in a way that could handle a tricky race with Sync in another thread (see code comments, thanks Changyu) while maintaining good performance and test-ability.
For more context, see the call to FlushWAL() in DBImpl::GetLiveFilesStorageInfo().
Also, the unit test for #12729 was neutered by #12797, and this re-enables the functionality it is testing.
Test Plan: unit test expanded/updated. Local runs of blackbox_crash_test.
The implementation is structured so that a multi-threaded unit test is not needed to cover at least the code lines, as the race handling is folded into "catch up after returning unsynced and then a sync."