-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable Parallel Parquet Writer by Default, Improve Writing Test Coverage #8854
Conversation
@@ -403,7 +403,7 @@ config_namespace! { | |||
/// parquet files by serializing them in parallel. Each column | |||
/// in each row group in each output file are serialized in parallel | |||
/// leveraging a maximum possible core count of n_files*n_row_groups*n_columns. | |||
pub allow_single_file_parallelism: bool, default = true | |||
pub allow_single_file_parallelism: bool, default = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -64,6 +64,25 @@ select * from validate_parquet; | |||
1 Foo | |||
2 Bar | |||
|
|||
query ?? | |||
COPY | |||
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I left a comment there
Would it be possible to change the constant values of the two rows so it is clear they are different
e.g.
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')), | |
(values (struct ('foo', (struct ('foo', make_array(10,20,30)))), make_array(timestamp '2024-01-01 01:00:01',timestamp '2024-01-01 01:00:01')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed up a change to make the rows distinct and add an array of struct as #8867 is fixed after pulling main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @devinjdangelo -- adding tests is great and very much appreciated.
I think we can merge this PR as is, but if we can improve the tests slightly before doing so that would be great.
@@ -64,6 +64,25 @@ select * from validate_parquet; | |||
1 Foo | |||
2 Bar | |||
|
|||
query ?? | |||
COPY | |||
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I left a comment there
Would it be possible to change the constant values of the two rows so it is clear they are different
e.g.
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')), | |
(values (struct ('foo', (struct ('foo', make_array(10,20,30)))), make_array(timestamp '2024-01-01 01:00:01',timestamp '2024-01-01 01:00:01')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @devinjdangelo
Which issue does this PR close?
Related to #8853 and #8851
Rationale for this change
A number of issues were recently brought up related to the parallel parquet writer. Most appear to be related to how nested types are being handled. It is not immediately obvious where the flaw is related to handling nested types.
What changes are included in this PR?
Disables parallel writing by default. Adds tests to
copy.slt
that cover the failure cases.Are these changes tested?
Yes
Are there any user-facing changes?
Potentially fewer cores will be used by default when writing parquet files.