Skip to content

Conversation

@patgarz
Copy link
Contributor

@patgarz patgarz commented Aug 25, 2023

This adds an option to spin up breeze using --integration openlineage, which will spin up a simple implementation of Marquez and send metrics to namespace airflow.

Notes before merge -

  1. Marquez in their docs require PostgreSQL 12.1, but it seems to be working with Postgres 13 and 14 as well (but not 15). I can add this to the docs, but a better option may be to spin up a separate Postgres container (though this does significantly affect resourcing). Any thoughts on appropriately handling this?
  2. I was only able to get metrics to send by setting Breeze to install 2.7.0. It did not work in the main. I'm not sure whether that indicates a bug in my configuration or a bug that has been introduced in the main.
  3. Marquez webserver conflicts with Grafana in that they both spin up on Port 3000, so the standard mapping pattern of 2<port> does not work. I opted for 23100 so that this PR did not break --integration all, but if there's another preferred pattern, it's an easy fix.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Aug 25, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@patgarz patgarz force-pushed the openlineage-breeze-integration branch from 8e25fb0 to 818063b Compare August 25, 2023 19:48
@potiuk
Copy link
Member

potiuk commented Aug 25, 2023

Cool! Suggestiom though:

Should we add it to Breeze's cheathseet including ability to chnage is as env variables?

https://github.com/apache/airflow/blob/main/dev/breeze/src/airflow_breeze/utils/visuals.py#L89

@patgarz patgarz force-pushed the openlineage-breeze-integration branch from e289381 to ce74532 Compare August 26, 2023 16:14
@patgarz
Copy link
Contributor Author

patgarz commented Aug 26, 2023

Easy enough; added options to configure via <svc>_HOST_PORT & rebased.

I did not add it to the cheatsheet as the other optional metrics integrations aren't in that cheatsheet and I hesitate to add all of the integrations there as it will increase the cheetsheet length (and possibly add confusion since these aren't core Airflow services) significantly. If you disagree, let me know and I'll go ahead and add.

@potiuk
Copy link
Member

potiuk commented Aug 27, 2023

Needs conflict resoiving due to changes commands in main.

This adds an option to spin up breeze using --integration openlineage, which will spin up a simple implementation of Marquez and send metrics to namespace airflow.

Notes before merge -

  1. Marquez in their docs require PostgreSQL 12.1, but it seems to be working with Postgres 13 and 14 as well (but not 15). I can add this to the docs, but a better option may be to spin up a separate Postgres container (though this does significantly affect resourcing). Any thoughts on appropriately handling this?

We can print error and exit for 11 and 15 if you try to use openlineage integration. We already do similar checks for mysql/ARM combination. Look for: def enter_shell(**kwargs) -> RunCommandResult.

  1. I was only able to get metrics to send by setting Breeze to install 2.7.0. It did not work in the main. I'm not sure whether that indicates a bug in my configuration or a bug that has been introduced in the main.

That would be surprising. The one difference I can think of - naybe this is connected with the way openlineage provider (and any other) is used in "main" - the openlineage code is loaded directly from sources not from package. So some code might not work the same way (for example queryng package metada will not work if you run it from main (INSTALL_PROVIDERS_FROM_SOURCES="true" is set in breeze and ProvidersManager will look in sources for provider.yaml files rather than querying provider_info endpoint to get all the provider information). @mobuchowski - maybe that rings a bell?

  1. Marquez webserver conflicts with Grafana in that they both spin up on Port 3000, so the standard mapping pattern of 2<port> does not work. I opted for 23100 so that this PR did not break --integration all, but if there's another preferred pattern, it's an easy fix.

It's fine. We do it for others tool

@potiuk
Copy link
Member

potiuk commented Aug 27, 2023

BTW. Even if we do not print it in cheatsheet. It might be worthwile to log information that marquez is available at this and that addreess. This can be done in the same par of "enter_shell" pre-execution code that you can use to do Postgres verification. Would be nice to extract those checks (existing ARM/mysql and new Openlineage/postgres to separate methods).

Also I am wondering what we should do if we run openlineage but NOT postgres (say mysql). I would not really want to stop this from happening. I guess we should error out similarly as with Postgres 11, 15 and print information to use Postgres 12-14.

@mobuchowski
Copy link
Contributor

Marquez in their docs require PostgreSQL 12.1, but it seems to be working with Postgres 13 and 14 as well (but not 15). I can add this to the docs, but a better option may be to spin up a separate Postgres container (though this does significantly affect resourcing). Any thoughts on appropriately handling this?

@patgarz It's probably missing some word like "minimum" or "at least" 12.1 - can you link where you found this, to fix it?
Not working on 15 must be a bug, definitely worth fixing.

That would be surprising. The one difference I can think of - naybe this is connected with the way openlineage provider (and any other) is used in "main" - the openlineage code is loaded directly from sources not from package. So some code might not work the same way (for example queryng package metada will not work if you run it from main (INSTALL_PROVIDERS_FROM_SOURCES="true" is set in breeze and ProvidersManager will look in sources for provider.yaml files rather than querying provider_info endpoint to get all the provider information). @mobuchowski - maybe that rings a bell?

After #32692 loading plugin from provider should happen regardless if providers are installed from sources or from package - provided the plugins and providers aren't lazily loaded.

@patgarz
Copy link
Contributor Author

patgarz commented Aug 28, 2023

It's probably missing some word like "minimum" or "at least" 12.1 - can you link where you found this, to fix it?

@mobuchowski I was referencing the requirements here: https://github.com/MarquezProject/marquez#requirements

PG15 produces the following warning in Marquez startup, and the app then exits (though we should likely migrate this conversation over there):

WARN  [2023-08-28 15:55:03,189] org.flywaydb.core.internal.database.base.Database: Flyway upgrade recommended: PostgreSQL 15.4 is newer than this version of Flyway and support has not been tested. The latest supported version of PostgreSQL is 14.

@mobuchowski
Copy link
Contributor

@patgarz sure, created issue on Marquez GH: MarquezProject/marquez#2604

@patgarz patgarz force-pushed the openlineage-breeze-integration branch from ce74532 to e99c184 Compare August 28, 2023 17:16
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending regeneration of breeze images.

@patgarz patgarz force-pushed the openlineage-breeze-integration branch from e99c184 to ba46a15 Compare September 8, 2023 02:28
@potiuk
Copy link
Member

potiuk commented Sep 11, 2023

Need to solve conflicts still :(

@patgarz patgarz force-pushed the openlineage-breeze-integration branch from ba46a15 to fc2ec3d Compare September 12, 2023 22:02
@patgarz
Copy link
Contributor Author

patgarz commented Sep 12, 2023

Rebased & should be all set now.

@potiuk potiuk merged commit 249c44a into apache:main Sep 12, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Sep 12, 2023

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@potiuk
Copy link
Member

potiuk commented Sep 12, 2023

Cool :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants