Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Core SQL Provider #24476

Merged
merged 18 commits into from
Jun 27, 2022
Merged

Add Core SQL Provider #24476

merged 18 commits into from
Jun 27, 2022

Conversation

denimalpaca
Copy link
Contributor

@denimalpaca denimalpaca commented Jun 15, 2022

Adds operators, tests, and new and updated docs for a Core SQL
Provider. This provider is made in favor of adding these operators
to the existing SQL operators in core Airflow. The new provider
will allow for quicker development cycles.

closes: #23874, #24422
related: #23915

@denimalpaca
Copy link
Contributor Author

@kaxil @potiuk @mik-laj I'm not sure why these remaining checks are failing -- one seems to be for an Airflow doc I didn't touch, and the other is a breeze output.
The breeze one is a bit nebulous: Update output of breeze commands in BREEZE.rst.........................................Failed seems like something breeze should be doing automatically.
The other doc error seems to be around imports I didn't touch:
ModuleNotFoundError: No module named 'airflow.provider_info'
and a link error:
apache-airflow-providers /opt/airflow/docs/apache-airflow-providers/packages-ref.rst::Anonymous hyperlink mismatch: 1 references but 0 targets.
but no line number or other info, and in a file I didn't touch. Also, this command passes locally.

@potiuk
Copy link
Member

potiuk commented Jun 16, 2022

Please rebase to latest main. You are 28 commits behind the main. So the changes might come from outdated code your PR is based on. This should always be the first thing you do when you see failures that you don't recognize.

@potiuk
Copy link
Member

potiuk commented Jun 16, 2022

Rbase early, rebase often is the mantra that everyone should follow.

BTW. You cna do it easily with the drop-down next to "Update banch" at the bottom here.

@denimalpaca denimalpaca force-pushed the add_sql_provider branch 3 times, most recently from 271b948 to ddd1da0 Compare June 17, 2022 13:49
@denimalpaca
Copy link
Contributor Author

@potiuk rebased a few times now, still seeing these errors.

@potiuk
Copy link
Member

potiuk commented Jun 17, 2022

Yes. you need to install pre-commit and run those pre-commits locally to fix them. Those are real problems. Also your docs have errors to be fixed.

@denimalpaca
Copy link
Contributor Author

denimalpaca commented Jun 21, 2022

Ok. I do have pre-commit installed and it definitely was running before all my commits, which is why I was so confused about the errors, because other errors did come up in pre-commit that I fixed before pushing anything.

The only document I see failing anymore is one that I did not touch, but I'll run that again locally and see. Error message wasn't helpful but the error was in my doc.

@potiuk
Copy link
Member

potiuk commented Jun 21, 2022

Always rebase as first check - you are 36 commits behind. The problem could have been fixed already

image

@potiuk
Copy link
Member

potiuk commented Jun 21, 2022

Besides - you might have to run the same as is done in CI - run pre-commit with --all-files . And if you carefully read what static checks tell you, you will see the exact precommit and command to run to re-run it

It looks like in this case the problem is with images conflicting as command line "help" because someone (likely me) added another command so you have now conflicting images in yours and main version.

You wlll need to run breeze regenerate-command images to fix them

@potiuk
Copy link
Member

potiuk commented Jun 21, 2022

BTW: @denimalpaca -> #24581 I've added better instructions in both - the file that might get conflicted if multiple people modify breeze commands at the same time as well in the error message printed by pre-commit.

It will explicittly tel you to run breeze regenerate-command-images in such cases, so that I don't have to do it in PRs :)

@potiuk
Copy link
Member

potiuk commented Jun 21, 2022

BTW @denimalpaca I just merged #24581 - so actually you might need to rebase again (and this time you should get the right instuctions when you get conflict!)

@denimalpaca denimalpaca force-pushed the add_sql_provider branch 3 times, most recently from 7382861 to ca8e87a Compare June 21, 2022 20:32
@denimalpaca
Copy link
Contributor Author

@potiuk still not sure why tests are failing... now it's a mypy error that wasn't there before and a Error: Process completed with exit code 247. which seems to be a docker resource issue. The branch is rebased with latest updates at the time of this comment.

Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the location /providers/core is appropriate because core doesn't really give much information. Maybe we should move all the SQL operators in airflow.operator into the providers package at /providers/sql?
Though, I don't really know why the SQL operators are still in core airflow, so it's something to check

@denimalpaca
Copy link
Contributor Author

I don't think the location /providers/core is appropriate because core doesn't really give much information. Maybe we should move all the SQL operators in airflow.operator into the providers package at /providers/sql? Though, I don't really know why the SQL operators are still in core airflow, so it's something to check

There are some issues with having a providers.sql because there technically is already one, and one gets added in one of the pre-commit tests for provider packages to make sure there aren't duplicate names.

@potiuk
Copy link
Member

potiuk commented Jun 23, 2022

The idea of core.sql name came from me and I tihnk it makes sense for the following reasons:

  • this is not a provider that is usable on it's own. you would not install just "core-sql" provider - you would simply get it as dependency to other providers that use "sql"

  • this provider - unlike the providers we have so far - does not provide "specific" integration but it provides common airflow functionality that should be released independently from the "core" airflow. I believe we need a way to distinguish such "core extension" functionality from "regular" providers. We can have more similar "functionalities" we want to extract.

  • The idea behind it is that if - for example - you want to add "sql lineage" features (which I belive was the main reason for even thinking about such "sql-extraction". you could add it to "core sql" provider and release it with new "snowflake", "oracle", "mysql" providers depending on the new version of such "core sql" provider.

  • as the result- you could use better (I am guessing column-based) airflow lineage integration immediately when such "core" provider is released, without waiting for a) airflow 2.4 or 2.5 b) waiting until all SQL-related-providers bump min-airflow-version to thsoe 2.4 or 2.5 versions (or keep backwards compatibility integration for quite a while - which might be painful). If we add such "column lineage" extension today to DBAPI, this is what it would take.

  • It does not have to be "core" - maybe a better name would be good, but I believe we somehow need to distinguish such "functiona" providers from "actual integrations"

  • it is a bit of experiment - indeed - we have not done anything like that before, but I think this is the only way if we want to add such multi-provider functionality faster and make it available also for users who use a bit (not too much) earlier versions of airlfow. But I am totally open to other ideas here :).

This is reasoning behind the idea - happy to hear about any comments but just wanted to provide a context here.

Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk, your explanation makes sense to me and I can't think of a better name.

airflow/providers/core/sql/operators/sql.py Outdated Show resolved Hide resolved
airflow/providers/core/sql/operators/sql.py Show resolved Hide resolved
denimalpaca and others added 17 commits June 27, 2022 11:09
Adds operators, tests, and new and updated docs for a Core SQL
Provider. This provider is made in favor of adding these operators
to the existing SQL operators in core Airflow. The new provider
will allow for quicker development cycles.
I am bad at git and this change got overwritten during a git pull
with the remote branch.

Add Core SQL Provider

Adds operators, tests, and new and updated docs for a Core SQL
Provider. This provider is made in favor of adding these operators
to the existing SQL operators in core Airflow. The new provider
will allow for quicker development cycles.

Fix build doc errors

Fix capitalization and grammar errors in doc
This shouldn't actually matter; because the `check_values` options are checked before this code is executed, there will always be some path to exit the function, whether it's a catch-all `return` at the end of the function as it was, or whether `equal_to` is checked last and the inner `if` doesn't run like the change suggests.

Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
Static checks require an always-accessible return statement.
The logic in _get_match() was previously short-circuiting after one
value check, however, multiple value checks per check type are
permitted. The logic is updated to ensure all value checks are
run and that any failure is propogated through the subsequent
checks. Additional tests are added to ensure proper logic.
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, going to merge this, feel free to take care of the other comments in a follow-up PR please

@@ -0,0 +1,16 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of "core.sql" too like @ephraimbuddy -- maybe "base.sql" is better 🤷 but I do know that there already is a BaseSQLOperator -- which we should think of moving to a provider too tbh.

But naming is heard. tagging a few people if they have better suggestions for names: @dstandish @jedcunningham @uranusjr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "common" would be better than "core"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still change it - before we release the provider and I think we should only do it after we move DBApiHook there (as explained in #23971).

I think common might be a better choice indeed.

if not records:
raise AirflowException(f"The following query returned zero rows: {self.sql}")

self.log.info(f"Record: {records}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.log.info(f"Record: {records}")
self.log.info("Record: %s", records)

Comment on lines +245 to +252
.. code-block:: python

{
"row_count_check": {"check_statement": "COUNT(*) = 1000"},
"column_sum_check": {"check_statement": "col_a + col_b < col_c"},
}

:param conn_id: the connection ID used to connect to the database
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be indented further, not sure though, can you post a screenshot of how this looks in docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that built if I run breeze build-docs? I can't seem to find a preview anywhere (and not sure that putting in a separate markdown file to preview is going to be an accurate reflection of this).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. when you run breeze build-docs --package-filter apache-airlfow-provider-core-sql and run ./docs/start-docs-server.sh or something - you should see the docs (it's described in contributing docs I am quite sure of that)

if not records:
raise AirflowException(f"The following query returned zero rows: {self.sql}")

self.log.info(f"Record: {records}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.log.info(f"Record: {records}")
self.log.info("Record: %s", records)

@kaxil kaxil merged commit eb6f45e into apache:main Jun 27, 2022
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:providers changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) kind:documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Templated SQL Check Operators for Columns and Tables
5 participants