-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: ignoreUnknownValues not working when using CreateDisposition.CREATE_IF_NEEDED #27892
Comments
.add-labels bigquery, io |
Label cannot be managed because it does not exist in the repo. Please check your spelling. |
.add-labels bigquery |
.add-labels io |
@Abacn @reuvenlax what are your thoughts on this change in options for the new API? |
Trying to understand the actual issue. Is the problem that the table is precreated but the schema passed into BigQueryIO contains fields not in the actual table? ignoreUnknownValues is meant for the case when the "records" contain unknown values - the assumption was that the value passed into withSchema matched the actual table schema. |
That's what this reads like to me, yes. So I guess it would be a different option, if we wanted to add it. Specifically it could allow someone's cronjob to keep functioning though dropping data if they update the schema of the table it targets? |
So this is an expected use of the sink. It wasn't really an expected us of STREAMING_INSERTS either, however since STREAMING_INSERTS does absolutely no schema vaidation, it just happened to work. |
I agree with you on this but but it seems like the way things are currently set up, the system is double-checking the schema before actually writing data. So, if we have a field in the BQ table and the json we're sending has data for that field, but the schema provided in the request doesn't include details about that field, it's storing the value as null. I'm curious, is this the intended behavior? Is there a reason behind it? Coming to schema ( |
The storage write API requires us to pass in a schema when we open a connection to it, and that schema must match the actual table schema. This is the schema you pass into withSchema. ignoreUnknownValues allows input element to have fields that don't match the schema, in which case Beam will simply ignore those extra fields. This scenario worked with STREAMING_INSERTS because the old STREAMING_INSERTS api didn't require us to pass in a schema. |
@reuvenlax What's the purpose of using But, it doesn't seem to function as described. Maybe the documentation needs to be updated. |
What happened?
We are currently executing a dataflow operation to transfer data from Kafka to BigQuery. Within this data flow, we have established a predetermined schema for BigQuery, which we use for generating new tables. The event from Kafka is updated with a new schema, our intention is to maintain the data in the existing tables while disregarding the newly introduced values.
In our initial implementation using STREAMING_INSERTS, we effectively addressed this scenario by configuring ignoreUnknownValues as true. This approach allowed us to manage this use case. However, upon transitioning to STORAGE_WRITE_API to achieve precisely once inserts, we encountered a challenge. Specifically, the aforementioned approach of utilizing ignoreUnknownValues is no longer effective, resulting in potential data loss.
The error we are encountering is as follows:
Steps to produce it:
TableSchema
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED).withSchema(schema).withMethod(BigQueryIO.Write.Method.STORAGE_WRITE_API)
while writing to BQTableSchema
Adding a table to provide more information and assume ignoreUnknownValues is set for all cases
*ANY represent, same behavior with of disposition(CREATE_NEVER / CREATE_IF_NEEDED)
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: