Skip to content

Optimize Unnest and implement skip_nulls=true if specified#7371

Merged
alamb merged 7 commits intoapache:mainfrom
smiklos:unnest-impl-skip-null
Aug 24, 2023
Merged

Optimize Unnest and implement skip_nulls=true if specified#7371
alamb merged 7 commits intoapache:mainfrom
smiklos:unnest-impl-skip-null

Conversation

@smiklos
Copy link
Contributor

@smiklos smiklos commented Aug 22, 2023

Which issue does this PR close?

related to #7087.
Closes #6961

Rationale for this change

This implements the new preserve_nulls option in unnest exec.
I've also adjusted the code to avoid copying and rely only on the take kernel as needed.

Todo

  • Set preserve_nulls to false -> If we really want align with clickhouse etc.

What changes are included in this PR?

The code will now respect the flag preserve_nulls in UnnestOptions.
This way we can also make the sql api consistent with Clickhouse/duckdb while allowing dataframe users to choose between implementations.

Are these changes tested?

Yes, added some new tests as well.

Are there any user-facing changes?

No

No

@github-actions github-actions bot added the core Core DataFusion crate label Aug 22, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nicely done -- thank you @smiklos 🏆

cc @vincev and @izveigor and @jayzhan211 who I think where also interested in this API

) -> Result<Arc<dyn Array + 'static>> {
let values = list_array.values();
if list_array.null_count() == 0 || !options.preserve_nulls {
Ok(values.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment on lines 438 to 443
let take_index =
<Int32Type as ArrowPrimitiveType>::Native::from_usize(
take_offset + i,
)
.unwrap();
builder.append_value(take_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be easier to understand

Suggested change
let take_index =
<Int32Type as ArrowPrimitiveType>::Native::from_usize(
take_offset + i,
)
.unwrap();
builder.append_value(take_index);
let take_index = take_offset + i;
builder.append_value(take_index as i32);

"+------+----+",
"| 1 | A |",
"| 2 | A |",
"| | B |", // this row should not be here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb alamb changed the title Unnest impl skip null Optimize Unest and implement skip_nulls=true if specified Aug 23, 2023
@smiklos
Copy link
Contributor Author

smiklos commented Aug 23, 2023

@alamb Thanks for the review!

Do we want to set preserve_nulls = false anywhere in the code? I recall the idea was to keep the dataframe api use preserve_nulls = true but when using sql unnest would drop the nulls

@vincev
Copy link
Contributor

vincev commented Aug 23, 2023

Do we want to set preserve_nulls = false anywhere in the code? I recall the idea was to keep the dataframe api use preserve_nulls = true but when using sql unnest would drop the nulls

Setting preserve_null = false for the dataframe api would be a breaking change for users as data pipelines that use dataframe unnest would now produce different results.

@alamb
Copy link
Contributor

alamb commented Aug 23, 2023

Do we want to set preserve_nulls = false anywhere in the code? I recall the idea was to keep the dataframe api use preserve_nulls = true but when using sql unnest would drop the nulls

I recommend that we merge this PR in without any changes to the default behavior and then we can propose a follow on PR with just the default behavior change (for SQL) by itself for discussion. That will make it easier to follow what exactly the proposed change in behavior is.

@Dandandan Dandandan changed the title Optimize Unest and implement skip_nulls=true if specified Optimize Unnest and implement skip_nulls=true if specified Aug 23, 2023
@alamb alamb merged commit c43d7be into apache:main Aug 24, 2023
@smiklos smiklos deleted the unnest-impl-skip-null branch August 24, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of unnest even more

3 participants