-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 'scheme' and 'schema' handling in HttpHook #35712
Conversation
Previously, when building a connection string in HttpHook.get_conn, conn.schema was used as URL scheme, but it should be URL path. Now conn.conn_type ('http' or 'https') is used as URL scheme and conn.schema as URL path which causes base_url in HttpHook to work as expected.
This PR should fix the issue, but I wonder what should be done about connections that were created in the past that rely on existing behaviour of 'schema' used as 'scheme' as a workaround. E.g. a connection to https://example.com was created with connection URI http://example.com/https. Should I provide a migration script to find all existing connections with type = 'http' and schema = 'https' and update them so that conn_type = 'https' and schema = null? |
http connection handling was copied from HttpHook to LivyHook so the error was carried over too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Airflow URI is not a valid:
- HTTP(S) URI
- SQLAlchemy
- Any of DB backend URI
- Any other type of URI representation
Airflow URI is a representation of Airflow Connection
@Taragolis ok, I am not trying to change Airflow URI into any of those. Please note that the scope of this PR is not any change in syntax of Airflow URI, only its semantic interpretation by HttpHook. |
@@ -575,10 +575,11 @@ def _generate_base_url(self, conn: Connection) -> str: | |||
if conn.host and "://" in conn.host: | |||
base_url: str = conn.host | |||
else: | |||
# schema defaults to HTTP | |||
schema = conn.schema if conn.schema else "http" | |||
scheme = conn.conn_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can not set connection type value in the UI, the only possible way is a select one from drop down menu.
There is no https
connection type registered into the core/community providers.
For example the Apache Livy Connection will use livy
, HTTP will use http
, for Airfbyte will use airbyte
, dbt Cloud will use dbt_cloud
and the same in others hooks which based on HttpHook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case, do you think it would make sense to change connections as follows:
- register
https
as a separate connection type which inherits from HTTP with the only difference inconn_type
so that it becomes visible in UI - for other hooks based on HttpHook, make them use connections of type
http
andhttps
instead of custom types. I understand that these integrations need custom hook implementations, but what is the point of creating a custom connection type for every hook type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it could handle connection by the different way.
I'm still confuse why it required to have https
connection type, what is a benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could use connection in a different way, but this difference doesn't change the connection type. It so happens that my org also uses Airbyte hook - I just checked and we use http
connection type (not airbyte
) for AirbyteHook and it works perfectly fine. It was accidental but it actually makes sense to me - just because AirbyteHook has some custom logic on top of HTTP connection doesn't make this connection anything else than HTTP because that logic (hook) is a higher level of abstraction than connection.
I can see a couple of benefits now:
- The tactical one is that this change fixes a bug HTTP connection parse path as schema but use as protocol #10913 where you can't correctly set path for
http
connection if this connection is created from Airflow URI because the path from URI is used as protocol in the hook. - More importantly - UX:
a) easier connection creation for new users - even though Airflow URI in principle is not a valid HTTP(S) URI, users can expect if to be interpreted as such for HTTP connections - nothing in the URI syntax in docs suggests that connection to https://example.com should be defined http://example.com/https so it's completely counter-intuitive
b) wouldn't it be easier for users to have fewer connection types to choose from in the drop-down list? In practice all they need to choose for those cases is either HTTP or HTTPS and with my proposal we could get rid of types likelivy
,airbyte
,dbt_cloud
etc. in favor of generichttp
andhttps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked and we use http connection type (not airbyte)
The main point here "you use", don't forget about other users or some one who create Connection from the UI or use JSON connection. Some of the connections also use additional fields, which not provided by HTTP connection, in additional HTTP connection provide additional configuration thought Extra field
easier connection creation for new users
URI not the only one way to define connection, and to be honest not he easiest way to create a connection, due to requirements of decode it
even though Airflow URI in principle is not a valid HTTP(S) URI, users can expect if to be interpreted as such for HTTP connections
And we return to initial comment #35712 (review)
This mentioned into the documentation. and maybe better to move it into the HTTP operator, maybe better move or duplicate it into the Connection.
wouldn't it be easier for users to have fewer connection types to choose in the drop-down list? In practice all they need to choose for those cases is HTTP/HTTPS
This action is required to create additional Hook for handle this "feature".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the connections also use additional fields
Http-based hooks don't use additional fields, some of them hide or re-label fields but it's cosmetic - no functionality is missing without this.
URI not the only one way to define connection, and to be honest not he easiest way to create a connection
This is subjective... JSON format is newer but both have its uses. People who specify connection with env variables will want a one-liner. And URI format is still supported by Airflow so why should it remain broken? Also, JSON format has problematic field names. Suppose you want to set up connection to https://example.com/endpoint
, then configuration would be like this:
"example_https": {
"conn_type": "http",
"host": "https://example.com/endpoint"
}
but clearly https://example.com/endpoint
is not a 'host', it's a URL so you will do it correctly only if you are familiar with Airflow JSON format and its quirks. This is significant cognitive load. After my proposal you could write it:
"example_https": {
"conn_type": "https",
"host": "example.com"
"schema": "endpoint"
}
(or in the URI format https://example.com/endpoint).
This action is required to create additional Hook for handle this "feature".
Can you elaborate what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Connection URI is not supposed to follow URI generic syntax standard like RFC 3986, then maybe it should be called "connection string" or sth like this and not "URI".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, this change is bigger in scope than I thought, and I can see how it could be controversial, so I probably need a devlist vote on it. I'll leave it in draft until I can fully articulate my thoughts on it.
I analysed possible refactoring of this to take the feedback @Taragolis into account but I think right now there are many dependencies on current syntax, like authentication which probably could use refactor first, so I will close it for now |
Fixes #10913
Previously, when building a connection string in HttpHook.get_conn, conn.schema was used as URL scheme, but it should be URL path. Now conn.conn_type ('http' or 'https') is used as URL scheme and conn.schema as URL path which causes base_url in HttpHook to work as expected.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.