Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery trim schema with selected fields #32514

Merged
merged 4 commits into from
Sep 27, 2024

Conversation

RustedBones
Copy link
Contributor

@RustedBones RustedBones commented Sep 20, 2024

Extracted from #32360

Trim BQ schema using selected fields instead of avro schema.

Trim BQ schema directly instead of converting to avro schema and back
if (selectedFields != null
&& selectedFields.isAccessible()
&& selectedFields.get() != null) {
return selectedFields.get().contains(field.getName());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selected nested field were always filtered out the beam schema

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some comments. Generally looks good, but would like to know the motivation for this change

Comment on lines +184 to +185
if (selectedFieldsProvider != null && selectedFieldsProvider.isAccessible()) {
tableSchema = BigQueryUtils.trimSchema(tableSchema, selectedFieldsProvider.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a little (maybe include in the PR description too) why it's better to trim based on selectedFields instead of the read session's schema? We pass selected fields as a component when building the read session, so I would think it returns a trimmed schema already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the operation here is correct. It is just for consistency/simplicity:

  • We also trim the schema in the BigQueryIO where we do not have access to a trimmed avro schema.
  • when we direct read with ARROW, to get access to the avro schema we do arrow -> beam -> avro schema conversions just for the sake of filtering known fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explaining. Just wanted to make sure I wasn't missing something bigger.

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ahmedabu98 ahmedabu98 merged commit 271ea43 into apache:master Sep 27, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants