Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openlineage: isolate metadata extraction by executing OL methods in separate, forked process #40078

Merged
merged 1 commit into from
Jun 14, 2024

Conversation

mobuchowski
Copy link
Contributor

@mobuchowski mobuchowski commented Jun 5, 2024

This PR builds on #39890 (already merged).

After this change, OpenLineage will execute metadata extraction in separate, forked process.
It's a technique modeled to what interaction between LocalTaskJobRunner and StandardTaskRunner looks like - a process, in this case process of StandardTaskRunner watches over OpenLineage listener process during metadata extraction.

This adds a layer of isolation between task execution and OpenLineage, adding a level of assurance that OpenLineage execution does not interfere with task execution in a way other than taking time.
Additionally, this allows us to add configurable timeout for OL execute methods.

The reason for that is, beyond configurability, that sometimes metadata extraction code can hang - for example, when dealing with Snowflake connection issue snowflakedb/snowflake-connector-python#1898 - and we want to give as much guarantees that OL will not cause task to fail.

@boring-cyborg boring-cyborg bot added area:providers area:Scheduler including HA (high availability) scheduler provider:google Google (including GCP) related issues provider:openlineage AIP-53 provider:snowflake Issues related to Snowflake provider labels Jun 5, 2024
@mobuchowski mobuchowski force-pushed the openlineage-process-execution branch 4 times, most recently from 2c96acb to 44ba855 Compare June 7, 2024 11:51
@mobuchowski mobuchowski force-pushed the openlineage-process-execution branch 3 times, most recently from 69865d6 to 7c13f5d Compare June 11, 2024 13:11
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits about splitting the PR

@potiuk
Copy link
Member

potiuk commented Jun 11, 2024

Ah I missed those are two commits/PRs already :)

@mobuchowski mobuchowski force-pushed the openlineage-process-execution branch from 55e5792 to 8cbb8bc Compare June 13, 2024 12:46
Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
@mobuchowski mobuchowski force-pushed the openlineage-process-execution branch from 8cbb8bc to 187d87e Compare June 14, 2024 12:39
@potiuk potiuk merged commit 1a8d12f into main Jun 14, 2024
107 checks passed
jannisko pushed a commit to jannisko/airflow that referenced this pull request Jun 15, 2024
…ss (apache#40078)

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
@eladkal eladkal deleted the openlineage-process-execution branch June 29, 2024 17:32
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
…ss (apache#40078)

Signed-off-by: Maciej Obuchowski <obuchowski.maciej@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers area:Scheduler including HA (high availability) scheduler provider:google Google (including GCP) related issues provider:openlineage AIP-53 provider:snowflake Issues related to Snowflake provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants