Skip to content

OpenLineage can silently lose Snowflake query_ids and can't support multiple query_ids #47488

@kacpermuda

Description

@kacpermuda

Apache Airflow Provider(s)

openlineage

Versions of Apache Airflow Providers

latest

Apache Airflow version

2.X

Operating System

macos

Deployment

Virtualenv installation

Deployment details

No response

What happened

When using SqlExecuteQueryOperator with Snowflake, and running a query with multiple statements in it, OpenLineage will only include first query_id in ExternalQueryRunFacet.

This is problematic, as users don't have full control on how the statements are executed (when query consists of multiple statements and split_statements=False operator throws an error snowflake.connector.errors.ProgrammingError: 000008 (0A000): 01bad84f-0000-4392-0000-3d95000110ce: Actual statement count 3 did not match the desired statement count 1.). The only solution for users to retrieve all query_ids in OL events is to set split_statements=False and make sure each task runs a single statement, which is rarely a case.

In BQ, similar problem is solved by "parent_query_job" executing each statement within a "child_query_job" with a link to the parent job, so that it's easy to access all ids later on. I couldn't find a similar mechanism in Snowflake.

What you think should happen instead

Ideally, from within a single task (SqlExecuteQueryOperator) we would emit a separate OL event for each statement run, containing parentRunFacet pointing to the Airflow task. This may however take some time to implement properly and may? (or not) need some adjustments from the consumers?

As a partial solution, we could extend ExternalQueryRunFacet with a new property that accepts multiple externalQueryIds. This requires some discussion from OL community as how it fits to the spec.

Another small note, right now we are already sending the entire sql query (with all the statements) in SQLJobFacet, regardless if they execute as separate "queries" or not. So it would probably need adjustment as well.

How to reproduce

Run a sample query like:

USE WAREHOUSE COMPUTE_WH;
CREATE OR REPLACE TABLE test.public.result AS SELECT * FROM snowflake_sample_data.tpch_sf1.customer;

You can see in Snowflake that this resulted in two queries being run, with two separate query_ids and only first one is included in OpenLineage event.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions