Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Parallel Parquet Writer by Default, Improve Writing Test Coverage #8854

Merged
merged 5 commits into from
Jan 16, 2024

Conversation

devinjdangelo
Copy link
Contributor

Which issue does this PR close?

Related to #8853 and #8851

Rationale for this change

A number of issues were recently brought up related to the parallel parquet writer. Most appear to be related to how nested types are being handled. It is not immediately obvious where the flaw is related to handling nested types.

What changes are included in this PR?

Disables parallel writing by default. Adds tests to copy.slt that cover the failure cases.

Are these changes tested?

Yes

Are there any user-facing changes?

Potentially fewer cores will be used by default when writing parquet files.

@@ -403,7 +403,7 @@ config_namespace! {
/// parquet files by serializing them in parallel. Each column
/// in each row group in each output file are serialized in parallel
/// leveraging a maximum possible core count of n_files*n_row_groups*n_columns.
pub allow_single_file_parallelism: bool, default = true
pub allow_single_file_parallelism: bool, default = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As flagged in #8851 and #8853, setting this to true will result in the new copy.slt tests to fail.

@@ -64,6 +64,25 @@ select * from validate_parquet;
1 Foo
2 Bar

query ??
COPY
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test case covers:

  1. Struct types
  2. Array types
  3. Multiple levels of nesting (struct of struct with array value)
  4. Timestamp types

I attempted to add array of struct as specifically flagged in #8851, but was unable to construct the literal value due to #8867.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I left a comment there

Would it be possible to change the constant values of the two rows so it is clear they are different

e.g.

Suggested change
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')),
(values (struct ('foo', (struct ('foo', make_array(10,20,30)))), make_array(timestamp '2024-01-01 01:00:01',timestamp '2024-01-01 01:00:01')),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed up a change to make the rows distinct and add an array of struct as #8867 is fixed after pulling main.

@devinjdangelo devinjdangelo marked this pull request as ready for review January 15, 2024 01:03
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devinjdangelo -- adding tests is great and very much appreciated.

I think we can merge this PR as is, but if we can improve the tests slightly before doing so that would be great.

@@ -64,6 +64,25 @@ select * from validate_parquet;
1 Foo
2 Bar

query ??
COPY
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I left a comment there

Would it be possible to change the constant values of the two rows so it is clear they are different

e.g.

Suggested change
(values (struct ('foo', (struct ('foo', make_array(1,2,3)))), make_array(timestamp '2023-01-01 01:00:01',timestamp '2023-01-01 01:00:01')),
(values (struct ('foo', (struct ('foo', make_array(10,20,30)))), make_array(timestamp '2024-01-01 01:00:01',timestamp '2024-01-01 01:00:01')),

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @devinjdangelo

@alamb alamb merged commit d2ff112 into apache:main Jan 16, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants