Skip to content

Conversation

@rtyler
Copy link
Member

@rtyler rtyler commented May 18, 2021

Description

This pull request is a work in progress, pending more complete riffing on the deltalake_ext.rs.

The objective here is that the primitive API will just handle writing some bytes to storage and the subsequent add transaction, whereas the high-level writer API will manage buffering of rows for a parquet file.

@rtyler rtyler force-pushed the mixing-in-deltalake_ext branch from 180774c to 5e4edde Compare May 21, 2021 04:38
@rtyler rtyler marked this pull request as ready for review May 21, 2021 04:39
@rtyler rtyler added binding/rust Issues for the Rust crate enhancement New feature or request labels May 21, 2021
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good 👍

Comment on lines 32 to 35
/**
* Attempt to construct the BufferedJSONWriter, will fail if the table's metadata is not
* present
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, worth considering standardizing all doc string comments to /// and //!, should save you two top and bottom lines as well. this seems to be the norm in other rust projects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is super tedious and whatever "other rust projects do" doesn't convince me much unless there is a lint to enforce it 😄

I'll change it anyways, but I'm registering my protest 😈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, I would agree with the tedious claim if you didn't put a * at the beginning of every line ;)

@rtyler rtyler force-pushed the mixing-in-deltalake_ext branch from a1d41a5 to c43ac47 Compare May 22, 2021 02:27
* zeros and the trailing c000 string
*
*/
path_parts.push(format!("part-00000-{}-c000.snappy.parquet", Uuid::new_v4()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typical hadoop parquet has a UUID, that is shared between all files in the batch. e.g. this useful in debugging.

instead of hardcoded 00000, i'd ask to do an incrementing counter.

and instead of hard-coding snappy, i'd advice to take it from the compression engine used in ParquetBuffer .. WriterProperties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nfx to summarize, this is the scheme we should be using for the partition name right?

part-{COUNTER}-{BATCH_UUID}-c000.ext.

Do you know what is the c000 segment for?

and instead of hard-coding snappy, i'd advice to take it from the compression engine used in ParquetBuffer .. WriterProperties.

I also agree with this one 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@houqp c000 might be the bucket number, but I don’t remember from top of my head. Will look once open a laptop next week.

i was writing custom parquet writer 4 years ago and researched it a bit then. It’s referenced from classes referenced in InsertIntoHadoopFsRelation in Apache Spark, for sure.. or at least not too far deep from it.

Main part: file naming won’t mean much for reading and writing (unless bucketed). It’s just for troubleshooting when something goes terribly wrong and you have to manually recover from fs. Probably won’t happen with delta 😜

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counter stuff I'll file a ticket for later

@rtyler rtyler force-pushed the mixing-in-deltalake_ext branch 4 times, most recently from 93cab12 to 63d45cf Compare May 23, 2021 04:58
rtyler added 8 commits May 23, 2021 07:58
…he "add" action

This API might receive some changes to accommodate users which need to provide
`txn` actions along with the add_file
…actions

This should support use-cases where the caller needs to put a txn or two into
the log
…level writer interface

I envision this writers module encapsulating a few different flavors of
high-level writers. The exact syntax/API surface that should be common for all
the high-level writers should be largely identical but I'm not ready to
stabilize around what's here for that.

I also am on the fence on how this should handle (or not handle) `txn` actions
…, and finish up BufferedJSONWriter

There are some optimizations that come to mind for this work, but at this point
I think it's ready for simple high-level JSON writer usage
This commit also addresses a number of other code review comments
@rtyler rtyler force-pushed the mixing-in-deltalake_ext branch from 63d45cf to b6f3bcb Compare May 23, 2021 15:10
@rtyler rtyler force-pushed the mixing-in-deltalake_ext branch from b6f3bcb to e1d8d10 Compare May 23, 2021 15:14
@rtyler rtyler merged commit dc403e6 into delta-io:main May 23, 2021
@rtyler rtyler deleted the mixing-in-deltalake_ext branch May 23, 2021 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants