-
Notifications
You must be signed in to change notification settings - Fork 537
Adding a primitive and high-level writer interface #248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b07d95f to
550a9ba
Compare
180774c to
5e4edde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good 👍
rust/src/writer.rs
Outdated
| /** | ||
| * Attempt to construct the BufferedJSONWriter, will fail if the table's metadata is not | ||
| * present | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, worth considering standardizing all doc string comments to /// and //!, should save you two top and bottom lines as well. this seems to be the norm in other rust projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is super tedious and whatever "other rust projects do" doesn't convince me much unless there is a lint to enforce it 😄
I'll change it anyways, but I'm registering my protest 😈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, I would agree with the tedious claim if you didn't put a * at the beginning of every line ;)
a1d41a5 to
c43ac47
Compare
rust/src/delta.rs
Outdated
| * zeros and the trailing c000 string | ||
| * | ||
| */ | ||
| path_parts.push(format!("part-00000-{}-c000.snappy.parquet", Uuid::new_v4())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typical hadoop parquet has a UUID, that is shared between all files in the batch. e.g. this useful in debugging.
instead of hardcoded 00000, i'd ask to do an incrementing counter.
and instead of hard-coding snappy, i'd advice to take it from the compression engine used in ParquetBuffer .. WriterProperties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nfx to summarize, this is the scheme we should be using for the partition name right?
part-{COUNTER}-{BATCH_UUID}-c000.ext.
Do you know what is the c000 segment for?
and instead of hard-coding snappy, i'd advice to take it from the compression engine used in ParquetBuffer .. WriterProperties.
I also agree with this one 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@houqp c000 might be the bucket number, but I don’t remember from top of my head. Will look once open a laptop next week.
i was writing custom parquet writer 4 years ago and researched it a bit then. It’s referenced from classes referenced in InsertIntoHadoopFsRelation in Apache Spark, for sure.. or at least not too far deep from it.
Main part: file naming won’t mean much for reading and writing (unless bucketed). It’s just for troubleshooting when something goes terribly wrong and you have to manually recover from fs. Probably won’t happen with delta 😜
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The counter stuff I'll file a ticket for later
93cab12 to
63d45cf
Compare
Originally authored by @xianwill
…he "add" action This API might receive some changes to accommodate users which need to provide `txn` actions along with the add_file
…actions This should support use-cases where the caller needs to put a txn or two into the log
…level writer interface I envision this writers module encapsulating a few different flavors of high-level writers. The exact syntax/API surface that should be common for all the high-level writers should be largely identical but I'm not ready to stabilize around what's here for that. I also am on the fence on how this should handle (or not handle) `txn` actions
…, and finish up BufferedJSONWriter There are some optimizations that come to mind for this work, but at this point I think it's ready for simple high-level JSON writer usage
This commit also addresses a number of other code review comments
63d45cf to
b6f3bcb
Compare
b6f3bcb to
e1d8d10
Compare
Description
This pull request is a work in progress, pending more complete riffing on the deltalake_ext.rs.
The objective here is that the primitive API will just handle writing some bytes to storage and the subsequent
addtransaction, whereas the high-level writer API will manage buffering of rows for a parquet file.