Skip to content

Add config option to ignore insert ids #296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

iht
Copy link

@iht iht commented Sep 4, 2020

When inserting in streaming in BigQuery, if you set insert ids (default option with the connector), BigQuery will deduplicate the insertions and the quotas (number of rows per second, size in bytes per second) will be much lower than without deduplication.

Currently, there is no option in the connector to disable this deduplication. This pull request adds a configuration option to ignore insert ids, and insert all rows with null id. This will disable the deduplication in BigQuery (risking duplicates insertions) and the applicable quotas will be much higher (millions of rows per second, GBs per second).

The documentation of BigQuery contains a mention to this option in the Apache Beam connector. I am working with customers who are missing a similar configuration option in this connector.

With this pull request, you can set the option bigQueryIgnoreInsertId to true to insert without deduplication and with higher qutoas.

More info:

@CLAassistant
Copy link

CLAassistant commented Sep 4, 2020

CLA assistant check
All committers have signed the CLA.

@codecov-commenter
Copy link

codecov-commenter commented Sep 4, 2020

Codecov Report

Merging #296 into master will decrease coverage by 0.24%.
The diff coverage is 64.70%.

@@             Coverage Diff              @@
##             master     #296      +/-   ##
============================================
- Coverage     66.10%   65.86%   -0.25%     
  Complexity      267      267              
============================================
  Files            32       32              
  Lines          1484     1497      +13     
  Branches        152      154       +2     
============================================
+ Hits            981      986       +5     
- Misses          450      456       +6     
- Partials         53       55       +2     
Impacted Files Coverage Δ Complexity Δ
...wepay/kafka/connect/bigquery/BigQuerySinkTask.java 56.63% <33.33%> (-0.59%) 27.00 <0.00> (ø)
...ka/connect/bigquery/utils/SinkRecordConverter.java 61.90% <50.00%> (-4.77%) 3.00 <0.00> (ø)
...nect/bigquery/write/batch/GCSBatchTableWriter.java 80.64% <66.66%> (-5.57%) 3.00 <0.00> (ø)
...afka/connect/bigquery/write/batch/TableWriter.java 67.79% <66.66%> (-2.38%) 6.00 <0.00> (ø)
...onnect/bigquery/config/BigQuerySinkTaskConfig.java 95.65% <100.00%> (+0.26%) 14.00 <0.00> (ø)

@C0urante
Copy link
Collaborator

C0urante commented Sep 8, 2020

@iht I think this is addressed in #277, which has been reviewed but not merged yet.

@iht
Copy link
Author

iht commented Sep 10, 2020

I should review the list of pending pull requests before attempting to contribute new changes...

Thanks for the heads up, I will keep an eye on #277 and will close this pull request.

@iht iht closed this Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants