Record Batch Writer #573

roeap · 2022-03-22T20:52:45Z

Description

This PR is part of the larger PR #523. Specifically an implementation of a record batch writer. The implementation is a composite of the json implementations from this crate and kafka-delta-ingest. The main addition is some logic to split a record batch into partitions according to the table partitioning.

Related Issue(s)

towards: #509

Documentation

houqp

Thanks @roeap for splitting up your PR into smaller ones :)

Looks good to me overall, one question I have is we already have a json writer in the deltalake::writer module, have you thought of how to better organize these two writer implementations in a more consistent namespace? I can see users getting confused between deltalake::writer and deltalake::write::writer modules.

houqp · 2022-03-25T03:59:58Z

rust/src/write/writer.rs

+    arrow_schema: Arc<ArrowSchema>,
+    writer_properties: WriterProperties,
+) -> Result<ArrowWriter<InMemoryWriteableCursor>, ParquetError> {
+    ArrowWriter::try_new(cursor, arrow_schema, Some(writer_properties))


looks like an overkill to create a function to just call another function with basically the same arguments?

true - will get rid of it.

houqp · 2022-03-25T04:06:12Z

rust/src/write/writer.rs

+    }
+
+    /// Writes the existing parquet bytes to storage and resets internal state to handle another file.
+    pub async fn flush(&mut self) -> Result<Vec<Add>, DeltaWriterError> {


the semantic of flush here is different from the one in the existing json writer, is there any particular reason why you to keep it different?

Not 100% sure which part you are referring to, but there are some things i considered :). The implementation with the json writer already creates a commit. Since we are hoping to eventually use this writer in a distributed context i though we should avoid doing so here. When its referring to the error type, these should probably be harmonized. In general it may be worth exploring to have a shared writer trait that different writers could implement?

Yes, I was referring to the flush method from the json writer:

delta-rs/rust/src/writer.rs

Line 94 in 346f51a

pub async fn flush(&mut self) -> Result<(), DeltaTableError> {

.

Since we are hoping to eventually use this writer in a distributed context i though we should avoid doing so here.

I think this makes sense, if so, then it's better to use a different name other than flush to avoid confusions.

In general it may be worth exploring to have a shared writer trait that different writers could implement?

I haven't put much thought into this yet, maybe think about what are the common interfaces we want to share between writers? From a quick glance, the writer method might not be a good fit because it takes different type of input depending on the input format that needs to be supported.

UPDATE: I think the flush and flush_and_commit proposed by you could be a common interface we want to enforce to all format specific writers.

roeap · 2022-03-25T06:21:17Z

@houqp - I did spend some time thinking on how to consolidate these namespaces, since having two namespaces with essentially the same name seems like a bad idea. Not only the namespace, but there is also a bunch of logic shared between the two writer implementations. In order to consolidate this we could maybe move the json writer and record batch writer into the same namespace and expose the same API via trait / generic. My guess would be, that the most prominent change would be to use the partitioning logic from the RB writer also in the json case. I was a bit hesitant to do this directly for being unsure about the performance implications. For larger data sizes, the sorting operations used in the RB writer could be fairly expensive, while the partitioning in the json writer seems quite clean and straight froward to me.

Maybe one way to have both functionalities is to add an additional method flush_and_commit to both writers?

In any case, I am more then happy to try and consolidate these within this PR based on your suggestions.

houqp · 2022-03-28T00:45:39Z

Not only the namespace, but there is also a bunch of logic shared between the two writer implementations. In order to consolidate this we could maybe move the json writer and record batch writer into the same namespace and expose the same API via trait / generic.

I agree this would be the ideal approach 👍

My guess would be, that the most prominent change would be to use the partitioning logic from the RB writer also in the json case. I was a bit hesitant to do this directly for being unsure about the performance implications.

I recommend we leave the duplication and optimization work as follow ups so we can focus on coming up with the right abstraction and interface in this PR. Sometimes a little bit of code duplication is not a bad thing :)

Maybe one way to have both functionalities is to add an additional method flush_and_commit to both writers?

That's a good idea to keep both writers consistent too. And we can leave the flush method to only write the parquet files and return the actions. Curious if @mosyp and @xianwill have any option on this.

roeap · 2022-03-28T20:42:54Z

@houqp - When trying to consolidate the writers a bit, I stumbled across another I believe significant difference in how the writers are designed. The json writer uses the add_file method on the DeltaTransaction to write files. The json writer from kafka-delta-ingest as well the RecordBatchWriter make use of the storage backend directly. The RecordBatchWriter is generally modelled more against the the json implementation from the kafka crate. To make the writer APIs more harmonized i think the writers doing the writing the the only feasible alternative, since in the envisioned usage within datafusion having the transaction available in all writers seems quite cumbersome, if not prohibitive.

As such I am unsure if we would want to keep the add_file method on the transaction implementation - to me this might get a bit confusing. Additionally the add_file method does not seem to compute file statistics, I pulled that logic as well from the kafka crate... Just wanted to make sure before going ahead and removing it.

houqp · 2022-03-29T05:22:03Z

@roeap I think what you proposed makes total sense 👍 The add_file abstraction was probably added when we don't have a good writer abstraction. It's a good idea to fully decouple data file IO and table log metadata IO.

roeap · 2022-04-01T21:47:44Z

@houqp - After looking a bit into harmonizing the implementations, it turned out that the required adjustments ran much deeper than hoped. So I went for an easier path and migrated the Json implementation from kafka-delta-ingest. Since the record batch implementation was also modelled against that implementation they were already somewhat similar. Still some cleaning up to do, but hope this is going the right way.

houqp

LGTM, thanks @roeap !

houqp · 2022-04-02T04:52:37Z

leaving it open for the weekend in case others want to chime in :)

wjones127

This looks like great progress towards a rust-based writer!

I noticed a few TODOs that look like something we might want to address before merging. I think it might also be worth adding some documentation to the writer as is.

wjones127 · 2022-04-02T16:34:19Z

rust/src/writer/mod.rs

+use std::convert::TryFrom;
+use std::sync::Arc;
+
+impl TryFrom<Arc<ArrowSchema>> for Schema {


Does this implementation not work for the same use case?

delta-rs/rust/src/delta_arrow.rs

Line 166 in 0b99151

impl TryFrom<&ArrowSchema> for schema::Schema {

Or, at the very least, should this impl be moved there?

makes sense. Moved to the arrow file, and simplified it using existing impl.

rust/src/writer/record_batch.rs

wjones127 · 2022-04-02T16:54:46Z

rust/src/writer/record_batch.rs

+    partition_columns: Vec<String>,
+    values: &RecordBatch,
+) -> Result<Vec<PartitionResult>, DeltaWriterError> {
+    // TODO remove panics within closures


Do you want to finish this TODO before merging?

got rid of all unwraps..

rust/src/writer/record_batch.rs

rust/src/writer/stats.rs

Co-authored-by: Will Jones <willjones127@gmail.com>

rust/Cargo.toml

houqp

LGTM with one minor nitpick

houqp · 2022-04-03T23:28:03Z

Thanks @roeap ! This is one big milestone :D

roeap added 3 commits March 22, 2022 21:49

Add record-batch writer

c7c1cda

clippy

421733a

clippy with latest rust version

9ff6c93

houqp reviewed Mar 25, 2022

View reviewed changes

houqp requested review from mosyp and xianwill March 25, 2022 04:09

roeap mentioned this pull request Mar 26, 2022

Python filesystems rewrite? #580

Closed

roeap added 3 commits March 28, 2022 21:58

consolidate naming

5296ab9

cleanup types

bd6d395

move next data path to utils

57b07b6

roeap added 2 commits March 28, 2022 22:52

add generic writer trait and use for RecordBatchWriter

fc754f7

fmt

01d764b

migrate json writer

4d2bbc1

houqp previously approved these changes Apr 2, 2022

View reviewed changes

houqp requested review from fvaleye, rtyler, thovoll and wjones127 April 2, 2022 04:52

roeap added 2 commits April 2, 2022 09:31

remove add_file from DeltaTransaction

ea6f125

housekeeping

5ba3391

roeap dismissed houqp’s stale review via 5ba3391 April 2, 2022 08:13

Merge branch 'main' into record-batch-writer

99027b8

wjones127 reviewed Apr 2, 2022

View reviewed changes

add docs for RecordBatchWriter

fbb3421

Co-authored-by: Will Jones <willjones127@gmail.com>

roeap added 2 commits April 2, 2022 22:15

PR comments

b724ef3

remove panics

a4b06bb

wjones127 previously approved these changes Apr 3, 2022

View reviewed changes

houqp reviewed Apr 3, 2022

View reviewed changes

rust/Cargo.toml Outdated Show resolved Hide resolved

houqp previously approved these changes Apr 3, 2022

View reviewed changes

remove comment from caro.toml

170d887

roeap dismissed stale reviews from houqp and wjones127 via 170d887 April 3, 2022 06:41

houqp approved these changes Apr 3, 2022

View reviewed changes

houqp merged commit 8c5ed46 into delta-io:main Apr 3, 2022

roeap deleted the record-batch-writer branch April 4, 2022 05:04

roeap mentioned this pull request Apr 28, 2022

[WIP] Towards high level writer #523

Closed

Jan-Schweizer mentioned this pull request Nov 2, 2023

Add counter information to written parquet files #262

Closed

Uh oh!

Record Batch Writer #573

Record Batch Writer #573

Uh oh!

Conversation

roeap commented Mar 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Documentation

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houqp Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roeap commented Mar 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

houqp commented Mar 28, 2022

Uh oh!

roeap commented Mar 28, 2022

Uh oh!

houqp commented Mar 29, 2022

Uh oh!

roeap commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

houqp commented Apr 2, 2022

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

houqp commented Apr 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roeap commented Mar 22, 2022 •

edited

Loading

houqp Mar 28, 2022 •

edited

Loading

roeap commented Mar 25, 2022 •

edited

Loading

roeap commented Apr 1, 2022 •

edited

Loading