Merge batcher generic over containers #474

antiguru · 2024-04-17T20:30:21Z

Merge batcher that's generic over input containers and internal chains, with specific implementations.

Ideas

At the moment, a merge batcher receives a stream of vectors. It consolidates the input vectors, and inserts them into its queue structure. When sealing, it extracts ready data and presents it record-by-record to the builder. It inserts future updates into its queue.

This introduces several opportunities to introduce containers:

The input the batcher: The stream of updates received from a timely operator. The batcher requires each batch to be consolidated, i.e., sorted and differences for same data accumulated. It can either take consolidated data, or consolidate it itself. We're missing an abstraction that allows the batcher to consolidate containers, and often containers are read-only, or we only get to look at the data.
- We could sort by permuting a Vec<usize> and copying into a new container in sorted order.
- Consolidating requires addition, which either requires storage or mutability, which is odd for read-only containers. Not sure what to do here.
The internal queue structure. We need to traverse the elements in order, and compare on data and time, accumulating diffs as we go. This requires specific knowledge about items of a container, which motivates moving the bulk of the implementation behind a trait to avoid generic parameters.
Ready data presented to the builder. We have an opportunity to present all data to the builder at once, which relieves it from incrementally building batches. It unlocks "looking ahead" in the input data.

This change splits the default merge batcher implementation into a type that maintains the outer part of its algorithm, specifically knows how to maintain chains, and an inner part that knows how to maintain the individual batches in chains. The benefit is that the outer part does not need to know about the contents of the containers it holds on to because that's encapsulated in the inner trait's implementation. Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

frankmcsherry

This generally looks good! I left some comments from our review, one of which is correctness-y, but others are nits that we can clean up as you like. On bonus ask is that perhaps we could find a name other than Batch to avoid clashing with pre-existing uses. We discussed Block or Chunk, neither of which are especially more insightful .. but if another name presents itself amazing! :D

src/trace/implementations/merge_batcher.rs

frankmcsherry · 2024-04-25T20:31:14Z

src/trace/implementations/merge_batcher.rs

+        for mut buffer in merged {
+            for (data, time, diff) in buffer.drain(..) {
+                if upper.less_equal(&time) {
+                    frontier.insert(time.clone());


Consider insert_ref here to avoid a clone! :D

src/trace/implementations/merge_batcher_col.rs

Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

frankmcsherry

Read through, and it all seems plausible! Hard to be 100% certain, but it seems like a great path forward.

frankmcsherry · 2024-04-26T17:56:38Z

src/trace/implementations/merge_batcher.rs

+        let form_chain = |this: &mut Self, final_chain: &mut Vec<Self::Chunk>, stash: &mut _| {
+            if this.pending.len() == this.pending.capacity() {
+                consolidate_updates(&mut this.pending);
+                if this.pending.len() > this.pending.capacity() / 2 {


Nit, but I think this can be >=. More generally, I think we are looking for this.pending.len() >= this.chunk_capacity(), if we ever end up not maintaining exactly twice the capacity. Idk if it's worth switching over to reveal the intent. If that is the intent (I inferred it, but it could be wrong).

That makes sense, and I had the same hunch at some point. Changed it to what you suggest, because it seems to be easier to reason about. The amount of data we compact isn't affected by this because we'll merge the chains at some point anyways.

src/trace/implementations/merge_batcher_col.rs

Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

antiguru · 2024-04-22T20:37:30Z

src/trace/implementations/merge_batcher.rs

+    /// TODO
+    type Time;
+    /// TODO
+    fn accept(&mut self, batch: RefOrMut<C>, stash: &mut Vec<Self::Batch>) -> Self::Batch;


Return type should probably be an iterator over batches.

src/trace/implementations/merge_batcher.rs

src/trace/implementations/merge_batcher_col.rs

antiguru · 2024-04-26T20:35:10Z

src/trace/implementations/merge_batcher.rs

+        let form_chain = |this: &mut Self, final_chain: &mut Vec<Self::Chunk>, stash: &mut _| {
+            if this.pending.len() == this.pending.capacity() {
+                consolidate_updates(&mut this.pending);
+                if this.pending.len() > this.pending.capacity() / 2 {


That makes sense, and I had the same hunch at some point. Changed it to what you suggest, because it seems to be easier to reason about. The amount of data we compact isn't affected by this because we'll merge the chains at some point anyways.

antiguru force-pushed the container_merge_batcher branch 4 times, most recently from fc409d5 to ca9d2e7 Compare April 24, 2024 19:25

antiguru added 2 commits April 24, 2024 15:27

Undo some changes, rip out old columnated batcher

b07aa9e

Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

antiguru force-pushed the container_merge_batcher branch from ca9d2e7 to b07aa9e Compare April 24, 2024 19:54

antiguru changed the title ~~WIP: Merge batcher generic over containers~~ Merge batcher generic over containers Apr 24, 2024

antiguru marked this pull request as ready for review April 24, 2024 19:54

frankmcsherry approved these changes Apr 25, 2024

View reviewed changes

antiguru added 2 commits April 26, 2024 11:12

Address feedback

a3c290e

Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

Formatting and renaming

c5b1b1e

Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

frankmcsherry approved these changes Apr 26, 2024

View reviewed changes

antiguru mentioned this pull request Apr 26, 2024

Use new merge batcher, remove ValOwned MaterializeInc/materialize#26821

Merged

5 tasks

Address feedback

4eeb15f

Signed-off-by: Moritz Hoffmann <antiguru@gmail.com>

antiguru commented Apr 26, 2024

View reviewed changes

antiguru merged commit b281e50 into TimelyDataflow:master Apr 26, 2024
7 checks passed

antiguru deleted the container_merge_batcher branch May 7, 2024 19:08

This was referenced Oct 29, 2024

chore: release #532

Closed

chore: release #534

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge batcher generic over containers #474

Merge batcher generic over containers #474

antiguru commented Apr 17, 2024 •

edited

Loading

frankmcsherry left a comment

frankmcsherry Apr 25, 2024

frankmcsherry left a comment

frankmcsherry Apr 26, 2024

antiguru Apr 26, 2024

antiguru Apr 22, 2024

antiguru Apr 26, 2024

Merge batcher generic over containers #474

Merge batcher generic over containers #474

Conversation

antiguru commented Apr 17, 2024 • edited Loading

Ideas

frankmcsherry left a comment

Choose a reason for hiding this comment

frankmcsherry Apr 25, 2024

Choose a reason for hiding this comment

frankmcsherry left a comment

Choose a reason for hiding this comment

frankmcsherry Apr 26, 2024

Choose a reason for hiding this comment

antiguru Apr 26, 2024

Choose a reason for hiding this comment

antiguru Apr 22, 2024

Choose a reason for hiding this comment

antiguru Apr 26, 2024

Choose a reason for hiding this comment

antiguru commented Apr 17, 2024 •

edited

Loading