Append commit instead of individual transactions to commitlog#4140
Append commit instead of individual transactions to commitlog#4140
Conversation
This moves the following responsibilities to the datastore: - maintenance of the transaction offset - deciding how many transactions are in a commit
Allowing to restore `Committed` return
|
Nominating @gefjon and @Centril because they appeared in the reviewer suggestions. |
| /// Tests that, when a partial write occurs, we can read all flushed commits | ||
| /// up until the faulty one. | ||
| #[test] | ||
| fn reopen() { |
There was a problem hiding this comment.
I'm not sure what this was supposed to test originally, so I removed it.
| /// up until the faulty one. | ||
| #[test] | ||
| fn reopen() { | ||
| fn read_log_up_to_partial_write() { |
There was a problem hiding this comment.
This is basically the test previously named reopen.
| let writer = &mut self.head; | ||
| let committed = writer.commit(transactions)?; | ||
| if writer.len() >= self.opts.max_segment_size { | ||
| self.flush().expect("failed to flush segment"); |
There was a problem hiding this comment.
This seemed a bit surprising to me at first -- but the BufWriter has no way of knowing how many bytes did make it. So if flush fails, the buffer is basically garbage.
What is the typical length of the slice? |
Well, 1 :D |
Yeah, I think this is the right call until we need something else. |
Shubham8287
left a comment
There was a problem hiding this comment.
This looks code, simplifies commitlog's writes API alot.
I wonder, if we should test replication with this branch? To surface any bug, if exist for n!=1 case.
We will benefit from fn append_tx(&self, Transaction<Self::TxData>)which requires the offset to be supplied, but doesn't allocate a |
We don't really need batched transactions at the moment, so avoid the boxed array allocation. Durability::append_tx takes a `Transaction`, though, requiring the offset to be supplied by the datastore.
|
Note that Rust does not by default run destructors when the program is terminated by a signal (any signal). This, and the default being unconfirmed reads, are the reason the commitlog before this patch would flush after every write. I added a config option to preserve this behavior. Question is if we should make it the default for standalone. (We should probably also make use of the |
Changes the commitlog (and durability) write API, such that the caller decides how many transactions are in a single commit, and has to supply the transaction offsets.
This simplifies commitlog-side buffering logic to essentially a
BufWriter(which, of course, we must not forget to flush). This will help throughput, but offers less opportunity to retry failed writes. This is probably a good thing, as disks can fail in erratic ways, and we should rather crash and re-verify the commitlog (suffix) than continue writing.To that end, this patch liberally raises panics when there is a chance that internal state could be "poisoned" by partial writes, which may be debatable.
Motivation
The main motivation is to avoid maintaining the transaction offset in two places in such a way that they could diverge. As ordering commits is the responsibility of the datastore, we make it authoritative on this matter -- the commitlog will still check that offsets are contiguous, and refuse to commit if that's not the case.
A secondary, related motivation is the following:
A "commit" is an atomic unit of storage, meaning that a torn (partial) write of a commit will render the entire commit corrupt. There hasn't been a compelling case where we would want this, and have always configured the server to write exactly one transaction per commit.
The code to handle buffering of transactions is, however, rather complex, as it tries hard to allow the caller to retry writes at commit boundaries. An unfortunate consequence of this is that we'd flush to the OS very often, leaving throughput performance on the table.
So, if there is a compelling case for batching multiple transactions in a commit, it should be the datastore's responsibility.
API and ABI breaking changes
Breaks internal APIs only.
Expected complexity level and risk
5 - Mostly for the risk
Testing
Existing tests.