Skip to content

Conversation

@nathadfield
Copy link
Collaborator

When loading data into a BigQuery table, although there are options to create or truncate the destination table using CREATE/WRITE_DISPOSITION, it might also be desirable to recreate the table as part of the task but this is not currently possible and requires a separate task using BigQueryDeleteTableOperator.

Adding a force_delete parameter that simply calls the BigQuery hook's delete_table function would enable this.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Nov 7, 2024
@nathadfield nathadfield marked this pull request as ready for review November 7, 2024 13:29
@nathadfield nathadfield requested a review from potiuk November 7, 2024 13:46
@nathadfield nathadfield force-pushed the feat/gcs_to_bq_force_delete branch from 6304017 to b9771ef Compare November 7, 2024 16:08
@eladkal eladkal requested a review from shahar1 November 7, 2024 17:12
Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not strongly against, but why not using the existing the operator for that? (I'm questioning the atomicity of transfer operators in general)

@nathadfield
Copy link
Collaborator Author

nathadfield commented Nov 8, 2024

I'm not strongly against, but why not using the existing the operator for that? (I'm questioning the atomicity of transfer operators in general)

It's a fair question.

The thought process came about because of a scenario I have encountered revolving around the use of BigQuery dataset expiration policies that will automatically drop tables after a specified amount of time, e.g. 7 days, which we do for temporary/staging areas.

Now, suppose I use the GCSToBQOperator with CREATE_IF_NEEDED to load some data followed by another task to perform a query against it, initially this will result in a table being created that will expire exactly 7 days after it was created.

On that seventh day, if everything all runs at the same time, then the table will not have expired yet so the GCSToBQ task will succeed but not recreate the table. However, in the few seconds between this task ending and the downstream task starting it will be deleted resulting in a task failure due to the table not existing.

The current solution to this is to add a prior task using BigQueryDeleteTableOperator which is perfectly viable but just results in lots of extra tasks. Ideally there would be a another CREATE_DISPOSITION option in BigQuery - ALWAYS_RECREATE? - which would achieve the same outcome.

…bigquery.py

Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
@shahar1
Copy link
Contributor

shahar1 commented Nov 8, 2024

I'm not strongly against, but why not using the existing the operator for that? (I'm questioning the atomicity of transfer operators in general)

It's a fair question.

The thought process came about because of a scenario I have encountered revolving around the use of BigQuery dataset expiration policies that will automatically drop tables after a specified amount of time, e.g. 7 days, which we do for temporary/staging areas.

Now, suppose I use the GCSToBQOperator with CREATE_IF_NEEDED to load some data followed by another task to perform a query against it, initially this will result in a table being created that will expire exactly 7 days after it was created.

On that seventh day, if everything all runs at the same time, then the table will not have expired yet so the GCSToBQ task will succeed but not recreate the table. However, in the few seconds between this task ending and the downstream task starting it will be deleted resulting in a task failure due to the table not existing.

The current solution to this is to add a prior task using BigQueryDeleteTableOperator which is perfectly viable but just results in lots of extra tasks. Ideally there would be a another CREATE_DISPOSITION option in BigQuery - ALWAYS_RECREATE? - which would achieve the same outcome.

Sounds fine by me, I'd be happy for additional feedback before merging.

@potiuk
Copy link
Member

potiuk commented Nov 11, 2024

LGTM

@potiuk potiuk merged commit 606ef45 into apache:main Nov 11, 2024
ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…43785)

* Adding a parameter to provide the option to force delete the destination table if it already exists.

* Adding a test for force_delete

* Update providers/src/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py

Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>

---------

Co-authored-by: Shahar Epstein <60007259+shahar1@users.noreply.github.com>
@nathadfield nathadfield deleted the feat/gcs_to_bq_force_delete branch November 11, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants