Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'scheme' and 'schema' handling in HttpHook #35712

Closed
wants to merge 5 commits into from

Conversation

Bisk1
Copy link
Contributor

@Bisk1 Bisk1 commented Nov 17, 2023

Fixes #10913

Previously, when building a connection string in HttpHook.get_conn, conn.schema was used as URL scheme, but it should be URL path. Now conn.conn_type ('http' or 'https') is used as URL scheme and conn.schema as URL path which causes base_url in HttpHook to work as expected.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Previously, when building a connection string in HttpHook.get_conn, conn.schema was used as URL scheme, but it should be URL path. Now conn.conn_type ('http' or 'https') is used as URL scheme and conn.schema as URL path which causes base_url in HttpHook to work as expected.
@Bisk1
Copy link
Contributor Author

Bisk1 commented Nov 17, 2023

This PR should fix the issue, but I wonder what should be done about connections that were created in the past that rely on existing behaviour of 'schema' used as 'scheme' as a workaround. E.g. a connection to https://example.com was created with connection URI http://example.com/https. Should I provide a migration script to find all existing connections with type = 'http' and schema = 'https' and update them so that conn_type = 'https' and schema = null?

@Bisk1 Bisk1 marked this pull request as ready for review November 17, 2023 19:36
@Bisk1 Bisk1 marked this pull request as draft November 17, 2023 19:45
daniel.dylag added 3 commits November 17, 2023 21:05
http connection handling was copied from HttpHook to LivyHook so the error was carried over too
Copy link
Contributor

@Taragolis Taragolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airflow URI is not a valid:

  • HTTP(S) URI
  • SQLAlchemy
  • Any of DB backend URI
  • Any other type of URI representation

Airflow URI is a representation of Airflow Connection

@Bisk1
Copy link
Contributor Author

Bisk1 commented Nov 17, 2023

@Taragolis ok, I am not trying to change Airflow URI into any of those. Please note that the scope of this PR is not any change in syntax of Airflow URI, only its semantic interpretation by HttpHook.

@@ -575,10 +575,11 @@ def _generate_base_url(self, conn: Connection) -> str:
if conn.host and "://" in conn.host:
base_url: str = conn.host
else:
# schema defaults to HTTP
schema = conn.schema if conn.schema else "http"
scheme = conn.conn_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can not set connection type value in the UI, the only possible way is a select one from drop down menu.
There is no https connection type registered into the core/community providers.

For example the Apache Livy Connection will use livy, HTTP will use http, for Airfbyte will use airbyte, dbt Cloud will use dbt_cloud and the same in others hooks which based on HttpHook

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, do you think it would make sense to change connections as follows:

  • register https as a separate connection type which inherits from HTTP with the only difference in conn_type so that it becomes visible in UI
  • for other hooks based on HttpHook, make them use connections of type http and https instead of custom types. I understand that these integrations need custom hook implementations, but what is the point of creating a custom connection type for every hook type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it could handle connection by the different way.
I'm still confuse why it required to have https connection type, what is a benefit?

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could use connection in a different way, but this difference doesn't change the connection type. It so happens that my org also uses Airbyte hook - I just checked and we use http connection type (not airbyte) for AirbyteHook and it works perfectly fine. It was accidental but it actually makes sense to me - just because AirbyteHook has some custom logic on top of HTTP connection doesn't make this connection anything else than HTTP because that logic (hook) is a higher level of abstraction than connection.

I can see a couple of benefits now:

  1. The tactical one is that this change fixes a bug HTTP connection parse path as schema but use as protocol #10913 where you can't correctly set path for http connection if this connection is created from Airflow URI because the path from URI is used as protocol in the hook.
  2. More importantly - UX:
    a) easier connection creation for new users - even though Airflow URI in principle is not a valid HTTP(S) URI, users can expect if to be interpreted as such for HTTP connections - nothing in the URI syntax in docs suggests that connection to https://example.com should be defined http://example.com/https so it's completely counter-intuitive
    b) wouldn't it be easier for users to have fewer connection types to choose from in the drop-down list? In practice all they need to choose for those cases is either HTTP or HTTPS and with my proposal we could get rid of types like livy, airbyte, dbt_cloud etc. in favor of generic http and https

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and we use http connection type (not airbyte)

The main point here "you use", don't forget about other users or some one who create Connection from the UI or use JSON connection. Some of the connections also use additional fields, which not provided by HTTP connection, in additional HTTP connection provide additional configuration thought Extra field

easier connection creation for new users

URI not the only one way to define connection, and to be honest not he easiest way to create a connection, due to requirements of decode it

image

even though Airflow URI in principle is not a valid HTTP(S) URI, users can expect if to be interpreted as such for HTTP connections

And we return to initial comment #35712 (review)

This mentioned into the documentation. and maybe better to move it into the HTTP operator, maybe better move or duplicate it into the Connection.

wouldn't it be easier for users to have fewer connection types to choose in the drop-down list? In practice all they need to choose for those cases is HTTP/HTTPS

This action is required to create additional Hook for handle this "feature".

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the connections also use additional fields

Http-based hooks don't use additional fields, some of them hide or re-label fields but it's cosmetic - no functionality is missing without this.

URI not the only one way to define connection, and to be honest not he easiest way to create a connection

This is subjective... JSON format is newer but both have its uses. People who specify connection with env variables will want a one-liner. And URI format is still supported by Airflow so why should it remain broken? Also, JSON format has problematic field names. Suppose you want to set up connection to https://example.com/endpoint, then configuration would be like this:

"example_https": {
  "conn_type": "http",
  "host": "https://example.com/endpoint"
}

but clearly https://example.com/endpoint is not a 'host', it's a URL so you will do it correctly only if you are familiar with Airflow JSON format and its quirks. This is significant cognitive load. After my proposal you could write it:

"example_https": {
  "conn_type": "https",
  "host": "example.com"
  "schema": "endpoint"
}

(or in the URI format https://example.com/endpoint).

This action is required to create additional Hook for handle this "feature".

Can you elaborate what do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Connection URI is not supposed to follow URI generic syntax standard like RFC 3986, then maybe it should be called "connection string" or sth like this and not "URI".

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, this change is bigger in scope than I thought, and I can see how it could be controversial, so I probably need a devlist vote on it. I'll leave it in draft until I can fully articulate my thoughts on it.

@Bisk1
Copy link
Contributor Author

Bisk1 commented Nov 22, 2023

I analysed possible refactoring of this to take the feedback @Taragolis into account but I think right now there are many dependencies on current syntax, like authentication which probably could use refactor first, so I will close it for now

@Bisk1 Bisk1 closed this Nov 22, 2023
@Bisk1 Bisk1 deleted the fix-http-hook-get-conn branch November 22, 2023 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP connection parse path as schema but use as protocol
2 participants