-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery trim schema with selected fields #32514
Conversation
Trim BQ schema directly instead of converting to avro schema and back
if (selectedFields != null | ||
&& selectedFields.isAccessible() | ||
&& selectedFields.get() != null) { | ||
return selectedFields.get().contains(field.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selected nested field were always filtered out the beam schema
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
87d9b71
to
d251f07
Compare
Assigning reviewers. If you would like to opt out of this review, comment R: @Abacn for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, left some comments. Generally looks good, but would like to know the motivation for this change
...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceDef.java
Outdated
Show resolved
Hide resolved
if (selectedFieldsProvider != null && selectedFieldsProvider.isAccessible()) { | ||
tableSchema = BigQueryUtils.trimSchema(tableSchema, selectedFieldsProvider.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain a little (maybe include in the PR description too) why it's better to trim based on selectedFields
instead of the read session's schema? We pass selected fields as a component when building the read session, so I would think it returns a trimmed schema already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the operation here is correct. It is just for consistency/simplicity:
- We also trim the schema in the
BigQueryIO
where we do not have access to a trimmed avro schema. - when we direct read with
ARROW
, to get access to the avro schema we doarrow -> beam -> avro
schema conversions just for the sake of filtering known fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for explaining. Just wanted to make sure I wasn't missing something bigger.
...d-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOStorageQueryTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Extracted from #32360
Trim BQ schema using selected fields instead of avro schema.