Skip to content

Conversation

mihailotim-db
Copy link
Contributor

@mihailotim-db mihailotim-db commented Mar 18, 2025

What changes were proposed in this pull request?

AddMetadataColumns should add only unique and necessary metadata columns, not the entire child's metadata output

Why are the changes needed?

There are 3 reasons to make this change:

  1. Adding duplicates of metadata columns creates problems for single-pass analyzer, where we need to hack our way around adding these columns, because both AddMetadataColumns and ResolveReferences can add same attributes.
  2. Adding unique and only necessary metadata columns is more semantically correct
  3. This PR is also a preparation to fix SPARK-51545

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new suite to test AddMetadataColumns rule. Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Mar 18, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/unique_metadata_cols branch 4 times, most recently from 70f521d to ba26f39 Compare March 18, 2025 14:11
@mihailotim-db mihailotim-db changed the title [WIP] Fix add metadata columns [SPARK-51544][SQL] Add only unique and necessary metadata columns Mar 18, 2025
@mihailotim-db mihailotim-db force-pushed the mihailotim-db/unique_metadata_cols branch 2 times, most recently from 55e5cb8 to cb0223a Compare March 18, 2025 20:55
case attr
if requiredAttrIds.contains(attr.exprId) && !existingExprIds
.contains(attr.exprId) =>
existingExprIds.add(attr.exprId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given we won't have duplicated expr IDs within metadata columns, do we really need to update this ID set here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never actually guarantee that we won't have duplicate ExprIds in metadata output, we just inherit from below and append the new ones. Maybe instead of enforcing this here, we should enforce uniqueness when creating metadata?

@mihailotim-db mihailotim-db force-pushed the mihailotim-db/unique_metadata_cols branch from cb0223a to 076cb67 Compare March 19, 2025 15:07
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in aaa94c4 Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants