Add config option to ignore insert ids #296
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When inserting in streaming in BigQuery, if you set insert ids (default option with the connector), BigQuery will deduplicate the insertions and the quotas (number of rows per second, size in bytes per second) will be much lower than without deduplication.
Currently, there is no option in the connector to disable this deduplication. This pull request adds a configuration option to ignore insert ids, and insert all rows with
null
id. This will disable the deduplication in BigQuery (risking duplicates insertions) and the applicable quotas will be much higher (millions of rows per second, GBs per second).The documentation of BigQuery contains a mention to this option in the Apache Beam connector. I am working with customers who are missing a similar configuration option in this connector.
With this pull request, you can set the option
bigQueryIgnoreInsertId
totrue
to insert without deduplication and with higher qutoas.More info: