Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryColumnCheckOperator doesn't actually implement use_legacy_sql kwarg #28343

Closed
2 tasks done
tmccall8829 opened this issue Dec 13, 2022 · 14 comments
Closed
2 tasks done
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@tmccall8829
Copy link

tmccall8829 commented Dec 13, 2022

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google==8.6.0

Apache Airflow version

v2.3.4

Operating System

Composer (so I think debian, probably?)

Deployment

Composer

Deployment details

We use a standard/vanilla composer deployment setup -- nothing fancy.

What happened

Using the following BigQueryColumnCheckOperator block:

# minimum_valid_fresh_date is a pendulum datetime obj
basic_column_quality_checks = BigQueryColumnCheckOperator(
        task_id="check_columns",
        table="my_dataset.my_table",
        use_legacy_sql=False,
        column_mapping={
            "updated_at": {"min": {"geq_to": minimum_valid_fresh_date}},
        },
    )

results in the following error in airflow:

google.api_core.exceptions.BadRequest: 400 Querying tables with INTERVAL or JSON type is not supported in Legacy SQL: 787345995969:my_dataset.my_table.

This occurs even though use_legacy_sql is set to False.

What you think should happen instead

Not sure if the problem here is that the error message is wrong/outdated, or if the use_legacy_sql flag isn't being actually passed through to BigQueryHook() when it gets called in BigQueryColumnCheckOperator._submit_job(). Either way, we should not be using legacy SQL in this case, since we explicitly specify use_legacy_sql=False.

How to reproduce

Run a simple DAG with just a single task (the operator I pasted above) on airflow v2.3.4 with apache-airflow-providers-google==8.6.0. The BQ table in question should have a column updated_at that is a DATETIME type.

Anything else

Digging into source, I think the problem is that the returned hook in BigQueryColumnCheckOperator._submit_job() does not get passed self.use_legacy_sql, so the hook obj just uses the default (which is set to True). If it's really that simple, happy to submit a PR to fix!

FWIW, it actually seems like this occurs in other BigQuery operators defined here as well... it could definitely be that I'm misreading the source, though, so apologies if that's the case!

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@tmccall8829 tmccall8829 added area:providers kind:bug This is a clearly a bug labels Dec 13, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 13, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@denimalpaca
Copy link
Contributor

Not entirely sure what's going on here, but it looks like it should be fine with the _BigQueryDbHookMixin class. That defines get_hook() which is using self.use_legacy_sql. So that's getting passed into the hook.

@tmccall8829
Copy link
Author

Thanks @denimalpaca , I spent some more time digging around the source and noticed that in the execute() method on the operator class. Appreciate the clarification!

Do you think maybe this is a composer-specific issue, then? Or something else? Because we're definitely seeing an error suggesting we have use_legacy_sql=True even though it's set to False.

I'll try to see if I can reproduce the error locally with the setup & dependencies above to rule out any composer-specific issues. Any ideas for other possible debugging routes?

@denimalpaca
Copy link
Contributor

The local reproduction would be really useful, if you provide some steps for that I can try it on my end as well.

Unfortunately I don't have any other debugging ideas. These operators are a bit obfuscated by the BigQuery API, so I'm not really sure how that use_legacy_sql param is actually being used under that hood.

@lwyszomi or @turbaszek might have more insight, they were recent (and maybe relevant) contributors according to the git blame!

@potiuk
Copy link
Member

potiuk commented Jan 2, 2023

Maybe also @bhirsz or @VladaZakharova might help

@VladaZakharova
Copy link
Contributor

Hi Team :)
I will test it on my end and will answer as soon as i have some results.

@eladkal
Copy link
Contributor

eladkal commented Jan 2, 2023

cc @vchiapaikeo maybe you can also take a look?

@eladkal eladkal added provider:google Google (including GCP) related issues good first issue labels Jan 2, 2023
@vchiapaikeo
Copy link
Contributor

I’ll let @VladaZakharova take a stab at it first. Going back to work tomorrow so might not have time for the next few days. Please let me know if I can help further though @VladaZakharova !

@tmccall8829
Copy link
Author

Thanks all! Testing this locally turns out to be a bit of a pain, given the integration with BigQuery and the fact that my org has our projects pretty locked-down, but it could also just be my own lack of experience with airflow. Please let me know if I can help along the way -- in the meantime, we're using the BigQueryCheckOperator with some more explicit SQL and that is just fine for us

@vchiapaikeo
Copy link
Contributor

vchiapaikeo commented Jan 7, 2023

Not entirely related but there seems to be a bug on main here:

        failed_tests(
            f"Column: {col}\n\tCheck: {check},\n\tCheck Values: {check_values}\n"
            for col, checks in self.column_mapping.items()
            for check, check_values in checks.items()
            if not check_values["success"]
        )

https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/bigquery.py#L614-L619

failed_tests is a list and is not callable. It seems like the call to extend is missing and results in this exception:

[2023-01-07, 01:46:36 UTC] {taskinstance.py:1797} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/providers/google/cloud/operators/bigquery.py", line 616, in execute
    for col, checks in self.column_mapping.items()
TypeError: 'list' object is not callable

This was introduced here (@denimalpaca):
87eb46b#diff-529929b4ca60ce73b8da0f45d8a5c43c2d4e391b913fe78b39892899f812951eR616-R621

After reverting this line to the previous working call (failed_tests.extend), I cannot reproduce this bug. I added a JSON column to my table and with use_legacy_sql=True, I get an exception as expected:

image

And with use_legacy_sql=False, things work as expected:

image

DAG I used to test:

from airflow import DAG

from airflow.providers.google.cloud.operators.bigquery import BigQueryColumnCheckOperator

DEFAULT_TASK_ARGS = {
    "owner": "gcp-data-platform",
    "retries": 1,
    "retry_delay": 10,
    "start_date": "2022-08-01",
}

with DAG(
    max_active_runs=1,
    concurrency=2,
    catchup=False,
    schedule_interval="@daily",
    dag_id="test_bigquery_column_check",
    default_args=DEFAULT_TASK_ARGS,
) as dag:

    basic_column_quality_checks = BigQueryColumnCheckOperator(
            task_id="check_columns",
            table="my-project.vchiapaikeo.test1",
            use_legacy_sql=False,
            column_mapping={
                "col1": {"min": {"greater_than": 0}},
            },
        )

@VladaZakharova
Copy link
Contributor

Hi Team :)
Sorry for late answer
I was trying to reproduce the error from the bug using fields with DATETIME datatypes, but it seems that i have faced the same problem as @vchiapaikeo :

[2023-01-08, 18:22:54 UTC] {bigquery.py:1539} INFO - Inserting job airflow_1673202174759817_9f8c4ae8e3753678300a383ad345da39
[2023-01-08, 18:22:56 UTC] {bigquery.py:603} INFO - Record:      col_name check_type               check_result
0  updated_at        min 2023-01-04 11:19:23.202309
[2023-01-08, 18:22:56 UTC] {taskinstance.py:1904} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/bigquery.py", line 616, in execute
    failed_tests(
TypeError: 'list' object is not callable

@tmccall8829 Can we also check how did you add some data to your dataset? Did you use already existing dataset or you have used some operator to upload it? Is this data is transferred from another storage like GCS?

@vchiapaikeo
Copy link
Contributor

Created this PR to fix the runtime / type error: #28796

@VladaZakharova
Copy link
Contributor

@vchiapaikeo Thank you for the quick PR for this fix!

@potiuk
Copy link
Member

potiuk commented Jan 9, 2023

Closed by #28796

@potiuk potiuk closed this as completed Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

No branches or pull requests

6 participants