-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Apache Airflow Provider(s)
openlineage
Versions of Apache Airflow Providers
latest
Apache Airflow version
2.X
Operating System
macos
Deployment
Virtualenv installation
Deployment details
No response
What happened
When using SqlExecuteQueryOperator with Snowflake, and running a query with multiple statements in it, OpenLineage will only include first query_id in ExternalQueryRunFacet.
This is problematic, as users don't have full control on how the statements are executed (when query consists of multiple statements and split_statements=False operator throws an error snowflake.connector.errors.ProgrammingError: 000008 (0A000): 01bad84f-0000-4392-0000-3d95000110ce: Actual statement count 3 did not match the desired statement count 1.). The only solution for users to retrieve all query_ids in OL events is to set split_statements=False and make sure each task runs a single statement, which is rarely a case.
In BQ, similar problem is solved by "parent_query_job" executing each statement within a "child_query_job" with a link to the parent job, so that it's easy to access all ids later on. I couldn't find a similar mechanism in Snowflake.
What you think should happen instead
Ideally, from within a single task (SqlExecuteQueryOperator) we would emit a separate OL event for each statement run, containing parentRunFacet pointing to the Airflow task. This may however take some time to implement properly and may? (or not) need some adjustments from the consumers?
As a partial solution, we could extend ExternalQueryRunFacet with a new property that accepts multiple externalQueryIds. This requires some discussion from OL community as how it fits to the spec.
Another small note, right now we are already sending the entire sql query (with all the statements) in SQLJobFacet, regardless if they execute as separate "queries" or not. So it would probably need adjustment as well.
How to reproduce
Run a sample query like:
USE WAREHOUSE COMPUTE_WH;
CREATE OR REPLACE TABLE test.public.result AS SELECT * FROM snowflake_sample_data.tpch_sf1.customer;
You can see in Snowflake that this resulted in two queries being run, with two separate query_ids and only first one is included in OpenLineage event.
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct