Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'scheme' and 'schema' handling in HttpHook #35712

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions airflow/providers/apache/livy/hooks/livy.py
Original file line number Diff line number Diff line change
Expand Up @@ -575,10 +575,11 @@ def _generate_base_url(self, conn: Connection) -> str:
if conn.host and "://" in conn.host:
base_url: str = conn.host
else:
# schema defaults to HTTP
schema = conn.schema if conn.schema else "http"
scheme = conn.conn_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can not set connection type value in the UI, the only possible way is a select one from drop down menu.
There is no https connection type registered into the core/community providers.

For example the Apache Livy Connection will use livy, HTTP will use http, for Airfbyte will use airbyte, dbt Cloud will use dbt_cloud and the same in others hooks which based on HttpHook

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, do you think it would make sense to change connections as follows:

  • register https as a separate connection type which inherits from HTTP with the only difference in conn_type so that it becomes visible in UI
  • for other hooks based on HttpHook, make them use connections of type http and https instead of custom types. I understand that these integrations need custom hook implementations, but what is the point of creating a custom connection type for every hook type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it could handle connection by the different way.
I'm still confuse why it required to have https connection type, what is a benefit?

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could use connection in a different way, but this difference doesn't change the connection type. It so happens that my org also uses Airbyte hook - I just checked and we use http connection type (not airbyte) for AirbyteHook and it works perfectly fine. It was accidental but it actually makes sense to me - just because AirbyteHook has some custom logic on top of HTTP connection doesn't make this connection anything else than HTTP because that logic (hook) is a higher level of abstraction than connection.

I can see a couple of benefits now:

  1. The tactical one is that this change fixes a bug HTTP connection parse path as schema but use as protocol #10913 where you can't correctly set path for http connection if this connection is created from Airflow URI because the path from URI is used as protocol in the hook.
  2. More importantly - UX:
    a) easier connection creation for new users - even though Airflow URI in principle is not a valid HTTP(S) URI, users can expect if to be interpreted as such for HTTP connections - nothing in the URI syntax in docs suggests that connection to https://example.com should be defined http://example.com/https so it's completely counter-intuitive
    b) wouldn't it be easier for users to have fewer connection types to choose from in the drop-down list? In practice all they need to choose for those cases is either HTTP or HTTPS and with my proposal we could get rid of types like livy, airbyte, dbt_cloud etc. in favor of generic http and https

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and we use http connection type (not airbyte)

The main point here "you use", don't forget about other users or some one who create Connection from the UI or use JSON connection. Some of the connections also use additional fields, which not provided by HTTP connection, in additional HTTP connection provide additional configuration thought Extra field

easier connection creation for new users

URI not the only one way to define connection, and to be honest not he easiest way to create a connection, due to requirements of decode it

image

even though Airflow URI in principle is not a valid HTTP(S) URI, users can expect if to be interpreted as such for HTTP connections

And we return to initial comment #35712 (review)

This mentioned into the documentation. and maybe better to move it into the HTTP operator, maybe better move or duplicate it into the Connection.

wouldn't it be easier for users to have fewer connection types to choose in the drop-down list? In practice all they need to choose for those cases is HTTP/HTTPS

This action is required to create additional Hook for handle this "feature".

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the connections also use additional fields

Http-based hooks don't use additional fields, some of them hide or re-label fields but it's cosmetic - no functionality is missing without this.

URI not the only one way to define connection, and to be honest not he easiest way to create a connection

This is subjective... JSON format is newer but both have its uses. People who specify connection with env variables will want a one-liner. And URI format is still supported by Airflow so why should it remain broken? Also, JSON format has problematic field names. Suppose you want to set up connection to https://example.com/endpoint, then configuration would be like this:

"example_https": {
  "conn_type": "http",
  "host": "https://example.com/endpoint"
}

but clearly https://example.com/endpoint is not a 'host', it's a URL so you will do it correctly only if you are familiar with Airflow JSON format and its quirks. This is significant cognitive load. After my proposal you could write it:

"example_https": {
  "conn_type": "https",
  "host": "example.com"
  "schema": "endpoint"
}

(or in the URI format https://example.com/endpoint).

This action is required to create additional Hook for handle this "feature".

Can you elaborate what do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Connection URI is not supposed to follow URI generic syntax standard like RFC 3986, then maybe it should be called "connection string" or sth like this and not "URI".

Copy link
Contributor Author

@Bisk1 Bisk1 Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, this change is bigger in scope than I thought, and I can see how it could be controversial, so I probably need a devlist vote on it. I'll leave it in draft until I can fully articulate my thoughts on it.

host = conn.host if conn.host else ""
base_url = f"{schema}://{host}"
base_url = f"{scheme}://{host}"
if conn.schema: # 'schema' is actually URI path (name kept for backward compatibility)
base_url += f"/{conn.schema}"
if conn.port:
base_url = f"{base_url}:{conn.port}"
return base_url
Expand Down
14 changes: 8 additions & 6 deletions airflow/providers/http/hooks/http.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,11 @@ def get_conn(self, headers: dict[Any, Any] | None = None) -> requests.Session:
if conn.host and "://" in conn.host:
self.base_url = conn.host
else:
# schema defaults to HTTP
schema = conn.schema if conn.schema else "http"
scheme = conn.conn_type
host = conn.host if conn.host else ""
self.base_url = f"{schema}://{host}"
self.base_url = f"{scheme}://{host}"
if conn.schema: # 'schema' is actually URI path (name kept for backward compatibility)
self.base_url += f"/{conn.schema}"

if conn.port:
self.base_url += f":{conn.port}"
Expand Down Expand Up @@ -323,10 +324,11 @@ async def run(
if conn.host and "://" in conn.host:
self.base_url = conn.host
else:
# schema defaults to HTTP
schema = conn.schema if conn.schema else "http"
scheme = conn.conn_type
host = conn.host if conn.host else ""
self.base_url = schema + "://" + host
self.base_url = f"{scheme}://{host}"
if conn.schema: # 'schema' is actually URI path (name kept for backward compatibility)
self.base_url += f"/{conn.schema}"

if conn.port:
self.base_url += f":{conn.port}"
Expand Down
33 changes: 12 additions & 21 deletions tests/providers/apache/livy/hooks/test_livy.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
LIVY_CONN_ID = LivyHook.default_conn_name
DEFAULT_CONN_ID = LivyHook.default_conn_name
DEFAULT_HOST = "livy"
DEFAULT_SCHEMA = "http"
DEFAULT_SCHEME = "http"
DEFAULT_PORT = 8998
MATCH_URL = f"//{DEFAULT_HOST}:{DEFAULT_PORT}"

Expand All @@ -59,24 +59,20 @@ def setup_class(cls):
db.merge_conn(
Connection(
conn_id=DEFAULT_CONN_ID,
conn_type="http",
conn_type=DEFAULT_SCHEME,
host=DEFAULT_HOST,
schema=DEFAULT_SCHEMA,
port=DEFAULT_PORT,
)
)
db.merge_conn(Connection(conn_id="default_port", conn_type="http", host="http://host"))
db.merge_conn(Connection(conn_id="default_protocol", conn_type="http", host="host"))
db.merge_conn(Connection(conn_id="port_set", host="host", conn_type="http", port=1234))
db.merge_conn(Connection(conn_id="schema_set", host="host", conn_type="http", schema="https"))
db.merge_conn(
Connection(conn_id="dont_override_schema", conn_type="http", host="http://host", schema="https")
)
db.merge_conn(Connection(conn_id="scheme_set", host="host", conn_type="https"))
db.merge_conn(Connection(conn_id="missing_host", conn_type="http", port=1234))
db.merge_conn(Connection(conn_id="invalid_uri", uri="http://invalid_uri:4321"))
db.merge_conn(
Connection(
conn_id="with_credentials", login="login", password="secret", conn_type="http", host="host"
conn_id="with_credentials", login="login", password="secret", conn_type="https", host="host"
)
)

Expand All @@ -91,8 +87,7 @@ def teardown_class(cls):
pytest.param("default_port", "http://host", id="default-port"),
pytest.param("default_protocol", "http://host", id="default-protocol"),
pytest.param("port_set", "http://host:1234", id="with-defined-port"),
pytest.param("schema_set", "https://host", id="with-defined-schema"),
pytest.param("dont_override_schema", "http://host", id="ignore-defined-schema"),
pytest.param("scheme_set", "https://host", id="with-defined-scheme"),
],
)
def test_build_get_hook(self, conn_id, expected):
Expand Down Expand Up @@ -356,7 +351,8 @@ def test_get_batch_state_validation(self, session_id, requests_mock):
requests_mock.register_uri(
"GET", f"{MATCH_URL}/batches/{session_id}/state", json=SAMPLE_GET_RESPONSE, status_code=200
)
assert LivyHook().get_batch_state(session_id) == BatchState.SUCCESS
hook = LivyHook()
assert hook.get_batch_state(session_id) == BatchState.SUCCESS

@pytest.mark.parametrize("session_id", INVALID_SESSION_ID_TEST_CASES)
def test_get_batch_state_validation_failed(self, session_id):
Expand Down Expand Up @@ -638,16 +634,11 @@ async def mock_fun(arg1, arg2, arg3, arg4):
assert response["status"] == "error"

def set_conn(self):
db.merge_conn(
Connection(conn_id=LIVY_CONN_ID, conn_type="http", host="host", schema="http", port=8998)
)
db.merge_conn(Connection(conn_id=LIVY_CONN_ID, conn_type="http", host="host", port=8998))
db.merge_conn(Connection(conn_id="default_port", conn_type="http", host="http://host"))
db.merge_conn(Connection(conn_id="default_protocol", conn_type="http", host="host"))
db.merge_conn(Connection(conn_id="port_set", host="host", conn_type="http", port=1234))
db.merge_conn(Connection(conn_id="schema_set", host="host", conn_type="http", schema="zzz"))
db.merge_conn(
Connection(conn_id="dont_override_schema", conn_type="http", host="http://host", schema="zzz")
)
db.merge_conn(Connection(conn_id="scheme_set", host="host", conn_type="zzz"))
db.merge_conn(Connection(conn_id="missing_host", conn_type="http", port=1234))
db.merge_conn(Connection(conn_id="invalid_uri", uri="http://invalid_uri:4321"))

Expand All @@ -659,15 +650,15 @@ def test_build_get_hook(self):
"default_port": "http://host",
"default_protocol": "http://host",
"port_set": "http://host:1234",
"schema_set": "zzz://host",
"dont_override_schema": "http://host",
"scheme_set": "zzz://host",
}

for conn_id, expected in connection_url_mapping.items():
hook = LivyAsyncHook(livy_conn_id=conn_id)
response_conn: Connection = hook.get_connection(conn_id=conn_id)
assert isinstance(response_conn, Connection)
assert hook._generate_base_url(response_conn) == expected
generated = hook._generate_base_url(response_conn)
assert generated == expected

def test_build_body(self):
# minimal request
Expand Down
28 changes: 26 additions & 2 deletions tests/providers/http/hooks/test_http.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,20 +226,28 @@ def run_and_return(unused_session, prepped_request, unused_extra_options, **kwar

@mock.patch("airflow.providers.http.hooks.http.HttpHook.get_connection")
def test_http_connection(self, mock_get_connection):
conn = Connection(conn_id="http_default", conn_type="http", host="localhost", schema="http")
conn = Connection(conn_id="http_default", conn_type="http", host="localhost")
mock_get_connection.return_value = conn
hook = HttpHook()
hook.get_conn({})
assert hook.base_url == "http://localhost"

@mock.patch("airflow.providers.http.hooks.http.HttpHook.get_connection")
def test_https_connection(self, mock_get_connection):
conn = Connection(conn_id="http_default", conn_type="http", host="localhost", schema="https")
conn = Connection(conn_id="http_default", conn_type="https", host="localhost")
mock_get_connection.return_value = conn
hook = HttpHook()
hook.get_conn({})
assert hook.base_url == "https://localhost"

@mock.patch("airflow.providers.http.hooks.http.HttpHook.get_connection")
def test_http_connection_with_path(self, mock_get_connection):
conn = Connection(conn_id="http_default", conn_type="http", host="localhost", schema="path")
mock_get_connection.return_value = conn
hook = HttpHook()
hook.get_conn({})
assert hook.base_url == "http://localhost/path"

@mock.patch("airflow.providers.http.hooks.http.HttpHook.get_connection")
def test_host_encoded_http_connection(self, mock_get_connection):
conn = Connection(conn_id="http_default", conn_type="http", host="http://localhost")
Expand All @@ -256,6 +264,22 @@ def test_host_encoded_https_connection(self, mock_get_connection):
hook.get_conn({})
assert hook.base_url == "https://localhost"

@mock.patch("airflow.providers.http.hooks.http.HttpHook.get_connection")
def test_host_encoded_https_connection_from_uri(self, mock_get_connection):
conn = Connection(conn_id="http_default", uri="https://localhost")
mock_get_connection.return_value = conn
hook = HttpHook()
hook.get_conn({})
assert hook.base_url == "https://localhost"

@mock.patch("airflow.providers.http.hooks.http.HttpHook.get_connection")
def test_host_encoded_https_connection_with_path_from_uri(self, mock_get_connection):
conn = Connection(conn_id="http_default", uri="https://localhost/path")
mock_get_connection.return_value = conn
hook = HttpHook()
hook.get_conn({})
assert hook.base_url == "https://localhost/path"

def test_method_converted_to_uppercase_when_created_in_lowercase(self):
assert self.get_lowercase_hook.method == "GET"

Expand Down
Loading