coord: Prioritize read latency over write latency #13967

jkosh44 · 2022-07-29T20:00:45Z

The goal of this commit is to make sure that the global timestamp never
advances past the upper of any table. Doing this will make sure that a
table read will never have to block and wait for an upper to advance,
which should improve the performance of many table reads. This
performance improvement comes at the cost of performance for table
writes and DDL.

This is accomplished by combining table advancements into group commit.
All writes will advance every other table to the same timestamp of the
write. Periodically empty group commits are triggered which include no
writes, but advances every table. The global timestamp is only advanced
after these table advancements to the timestamp of the advancement.

Motivation

This PR improves read performance.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-protobuf label.
This PR includes the following user-facing behavior changes:
- N/A

jkosh44 · 2022-08-03T21:46:31Z

FYI for anyone reading this. At one point this was passing every test but a single flaky test: https://buildkite.com/materialize/tests/builds/38774. Although I've been force pushing you can get to the code from the build if you want to see the difference. So hopefully this isn't too far from getting back to passing every test.

EDIT: This is now once again passing every test but a single flaky one.

The goal of this commit is to make sure that the global timestamp never advances past the upper of any table. Doing this will make sure that a table read will never have to block and wait for an upper to advance, which should improve the performance of many table reads. This performance improvement comes at the cost of performance for table writes and DDL. This is accomplished by combining table advancements into group commit. All writes will advance every other table to the same timestamp of the write. Periodically empty group commits are triggered which include no writes, but advances every table. The global timestamp is only advanced after these table advancements to the timestamp of the advancement.

frankmcsherry

I read it! I'm not certain I understand all of it (various nits about whether we are in DDL, changing behavior based on held locks, etc. The idea seems good, though, but I'd love for us to take a pass at some point to clarify the state machine that the coordinator is implementing.

frankmcsherry · 2022-08-06T18:18:13Z

src/adapter/src/coord.rs

@@ -754,7 +760,7 @@ impl<S: Append + 'static> Coordinator<S> {
                }
                // `tick()` on `Interval` is cancel-safe:
                // https://docs.rs/tokio/1.19.2/tokio/time/struct.Interval.html#cancel-safety
-                _ = advance_timelines_interval.tick() => Message::AdvanceTimelines,
+                _ = advance_timelines_interval.tick() => Message::GroupCommitInitiate,


If we wanted to keep things moving, we could tick this each time the group commit completes.

I wouldn't be against that. My one fear is that we'd end up clogging the Coordinator up somehow, but since append is now done asynchronously it should be fine?

frankmcsherry · 2022-08-06T18:20:28Z

src/adapter/src/coord.rs

-            if let Some(timestamp) = self.get_local_timestamp_oracle_mut().should_advance_to() {
-                self.queue_local_input_advances(timestamp).await;
-            }


By removing this, do we risk making the table advancement a low priority event? Seems removing this takes it from "once per message received" to "lowest priority event".

Queuing a table advancement was happening on a "once per message received" rate. However, actually executing the table advancements was already the lowest priority message that the Coordinator handled.

In practice the only things that triggered this were group commit, DDL, and the advance_timelines_interval, all of which now explicitly advance tables.

frankmcsherry · 2022-08-06T18:23:07Z

src/adapter/src/coord.rs

+    GroupCommitInitiate,
+    GroupCommitApply(
+        T,
+        Vec<CompletedClientTransmitter<ExecuteResponse>>,
+        Option<OwnedMutexGuard<()>>,
+    ),


Nit, but it would be helpful to have comments on these variants. I'm sort of guessing what Apply means, for example, maybe "it finished"? But, this enum in particular calls out for doccomments explaining what each message means.

Good idea, btw I was having a hard time coming up with a good name for this variant and settled on "Apply" if you have a better idea then let me know.

The idea is that "Apply" makes a group commit visible by doing the following:

Advancing the timeline.

Sending responses to clients.

Updating read holds (This is purely for convenience and we could do this elsewhere).

frankmcsherry · 2022-08-06T18:25:06Z

src/adapter/src/coord/timeline.rs

+/// and writes happen at the write timestamp. After the write has completed,
+/// but before a response is sent, the read timestamp must be updated to the
+/// same value as the write timestamp.


Is the condition maybe weaker, that "before responding to a write, the read timestamp must be updated to the timestamp of the write"? Or must the read timestamp be updated to self.write_ts?

What about something like

After the write has completed, but before a response is sent, the read timestamp must be updated to a value greater than or equal to self.write_ts.

That's a bit weaker and more accurate.

frankmcsherry · 2022-08-06T18:28:12Z

src/adapter/src/coord/timeline.rs

+        if next.less_equal(&self.state.write_ts) {
+            next = self.state.write_ts.step_forward();
        }


I'm not sure I follow this. I think this was inherited from the Reading(ts) case above, but I think here it is not as necessary? Or if it is, the comparison should be with self.state.read_ts instead of write_ts, as we want to be sure the write occurs after any previously issued read timestamp

Maybe this is an abstraction leak but my thought process was the following. You can't every call append twice with the same timestamp, so to prevent that from happening every write timestamp is larger than the previous write timestamp returned (this should be reflected in the doccomments, which it's not currently).

frankmcsherry · 2022-08-06T18:32:30Z

src/adapter/src/coord/timeline.rs

+    /// Peek the current write timestamp.
+    fn peek_write_ts(&self) -> T {


I read this and thought that maybe there was a different way to go from the previous design to here. I'll have to see how peek_write_ts gets used, but if it was just where we previously used write_ts, maybe there should be write_ts and seal_write_ts, or so.

peek_write_ts is used where previously peek_ts was used. Group commit uses it to determine how long to sleep for in case the write timestamp is ahead of the system clock.

frankmcsherry · 2022-08-06T18:37:17Z

src/adapter/src/coord/appends.rs

+pub(crate) enum BuiltinTableUpdateSource {
+    DDL,
+    Background,
+}


Doccomments on here to communicate the meaning.

frankmcsherry · 2022-08-06T18:40:03Z

src/adapter/src/coord/appends.rs

+                        // write was waiting, then the write will be ignored and we respond to the
+                        // client that the write was successful. This is only possible if the write


Might this be weird for a person observing through a SINK or TAIL?

yes, that's a good point. If someone is executing a tail, there's a possibility that we respond with an OK to a client who performs an insert and the person executing the tail won't see it.

This is actually existing behavior but probably not very good. Now that DDL is combined with writes, we should be able to remove this condition. Before actually deleting a table, we'd executing all pending writes which means that this scenario is impossible. I'd probably prefer to do that in a follow up though. I'll open an issue.

frankmcsherry · 2022-08-06T18:41:34Z

src/adapter/src/coord/appends.rs

+        }
+    }
+
+    /// Applies the results of a completed group commit. The global timestamp will be advanced to


"The global timestamp" -> "read timestamp"? I'm not actually sure what the global timestamp is tbh. This probably means something like "the read timestamp of the timeline(s) of the writes"?

I think there's still a lot of language in the comments in the Coord that assumes there's only a single timeline. The global timestamp generally refers to the timestamp of the real time timeline.

How does this sound

The read timestamp of the timeline containing user tables will be advanced to the timestamp of the completed write, the read hold on the timeline containing user tables is advanced to the new time, and responses are sent to all waiting clients.

frankmcsherry · 2022-08-06T18:42:04Z

src/adapter/src/coord/appends.rs

+        self.apply_local_write(timestamp).await;
+
+        // If we're in the middle of DDL then the catalog and COMPUTE/STORAGE may be out of sync,
+        // so we don't advance other timeliens and we don't update read holds.


typo: "timeliens"

…ance-table-group-commit � Conflicts: � src/adapter/src/coord.rs

jkosh44 force-pushed the advance-table-group-commit branch 13 times, most recently from 06f86ff to 1bdbf50 Compare August 3, 2022 01:01

jkosh44 changed the title ~~[WIP] Advance table group commit~~ coord: Prioritize read latency over write latency Aug 3, 2022

jkosh44 force-pushed the advance-table-group-commit branch 2 times, most recently from 35520ef to 2c0c647 Compare August 3, 2022 01:35

jkosh44 force-pushed the advance-table-group-commit branch 2 times, most recently from 234c3f7 to eed5fe9 Compare August 5, 2022 19:26

jkosh44 force-pushed the advance-table-group-commit branch from eed5fe9 to 5024870 Compare August 5, 2022 20:56

jkosh44 marked this pull request as ready for review August 5, 2022 21:26

jkosh44 requested review from maddyblue and frankmcsherry August 5, 2022 21:26

frankmcsherry approved these changes Aug 6, 2022

View reviewed changes

jkosh44 added 4 commits August 8, 2022 10:34

Respond to Frank's comments

e6d5473

Deflake test

4188858

Merge branch 'main' of github.com:MaterializeInc/materialize into adv…

535bda1

…ance-table-group-commit � Conflicts: � src/adapter/src/coord.rs

Further deflake test

a87ab70

jkosh44 merged commit 5d4bfc1 into MaterializeInc:main Aug 11, 2022

jkosh44 deleted the advance-table-group-commit branch August 11, 2022 12:33

		/// Peek the current write timestamp.
		fn peek_write_ts(&self) -> T {

		// write was waiting, then the write will be ignored and we respond to the
		// client that the write was successful. This is only possible if the write

coord: Prioritize read latency over write latency #13967

coord: Prioritize read latency over write latency #13967

Uh oh!

Conversation

jkosh44 commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Uh oh!

jkosh44 commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankmcsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkosh44 commented Jul 29, 2022 •

edited

Loading

jkosh44 commented Aug 3, 2022 •

edited

Loading