Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Add ArrowWriterOptions to skip embedding the arrow metadata #5299

Merged
merged 4 commits into from
Jan 22, 2024

Conversation

evenyag
Copy link
Contributor

@evenyag evenyag commented Jan 13, 2024

Which issue does this PR close?

Closes #5296.

Rationale for this change

What changes are included in this PR?

Adds the following APIs

  • ArrowWriter::try_new_with_options()
  • AsyncArrowWriter::try_new_with_options()
  • ArrowWriterOptions

Are there any user-facing changes?

Users can

  • use ArrowWriterOptions::with_skip_arrow_metadata() to set skip_arrow_metadata to true and skip embedding the arrow schema.
  • use ArrowWriterOptions::with_properties() to configure the parquet writer.

@evenyag evenyag changed the title parquet: Add ArrowWriterOptions to embedding the arrow metadata parquet: Add ArrowWriterOptions to skip embedding the arrow metadata Jan 13, 2024
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 13, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me thank you, I do think we should probably move WriterProperties on to ArrowWriterOptions though, as forcing users to juggle two options collections seems cumbersome. Apologies for the slow review, been a bit swamped lately

pub fn try_new_with_options(
writer: W,
arrow_schema: SchemaRef,
props: Option<WriterProperties>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could move this on to ArrowWriterOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll try it. The cpp FileWriter API separates these two parameters so I kept try_new_with_options() similar to theirs.

::arrow::Result<std::unique_ptr<FileWriter>> Open(
    const ::arrow::Schema &schema,
    MemoryPool *pool,
    std::shared_ptr<::arrow::io::OutputStream> sink,
    std::shared_ptr<WriterProperties> properties = default_writer_properties(),
    std::shared_ptr<ArrowWriterProperties> arrow_properties = default_arrow_writer_properties()
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we should probably move WriterProperties on to ArrowWriterOptions

This approach looks good to me. And we can also consider deprecating the old try_new method or making a breaking change to remove it, since these two are very similar

@tustvold tustvold merged commit ce58932 into apache:master Jan 22, 2024
17 checks passed
@tustvold
Copy link
Contributor

Thank you for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet: Allow disabling embed ARROW_SCHEMA_META_KEY added by the ArrowWriter
3 participants