Skip to content

Snowflake Provider connection documentation is misleading #24572

@dwreeves

Description

@dwreeves

What do you see as an issue?

Relevant page: https://airflow.apache.org/docs/apache-airflow-providers-snowflake/stable/connections/snowflake.html

Behavior in the Airflow package

The SnowflakeHook object in Airflow behaves oddly compared to some other database hooks like Postgres (so extra clarity in the documentation is beneficial).

Most notably, the SnowflakeHook does not make use of the either the host or port of the Connection object it consumes. It is completely pointless to specify these two fields.

When constructing the URL in a runtime context, snowflake.sqlalchemy.URL is used for parsing. URL() allows for either account or host to be specified as kwargs. Either one of these 2 kwargs will correspond with what we'd conventionally call the host in a typical URL's anatomy. However, because SnowflakeHook never parses host, any host defined in the Connection object would never get this far into the parsing.

Issue with the documentation

Right now the documentation does not make clear that it is completely pointless to specify the host. The documentation correctly omits the port, but says that the host is optional. It does not warn the user about this field never being consumed at all by the SnowflakeHook (source here).

This can lead to some confusion especially because the Snowflake URI consumed by SQLAlchemy (which many people using Snowflake will be familiar with) uses either the "account" or "host" as its host. So a user coming from SQLAlchemy may think it is fine to post the account as the "host" and skip filling in the "account" inside the extras (after all, it's "extra"), whereas that doesn't work.

I would argue that if it is correct to omit the port in the documentation (which it is), then host should also be excluded.

Furthermore, the documentation reinforces this confusion with the last few lines, where an environment variable example connection is defined that uses a host.

Finally, the documentation says "When specifying the connection in environment variable you should specify it using URI syntax", which is no longer true as of 2.3.0.

Solving the problem

I have 3 proposals for how the documentation should be updated to better reflect how the SnowflakeHook actually works.

  1. The Host option should not be listed as part of the "Configuring the Connection" section.

  2. The example URI should remove the host. The new example URI would look like this: snowflake://user:password@/db-schema?account=account&database=snow-db&region=us-east&warehouse=snow-warehouse. This URI with a blank host works fine; you can test this yourself:

    from airflow.models.connection import Connection
    
    c = Connection(conn_id="foo", uri="snowflake://user:password@/db-schema?account=account&database=snow-db&region=us-east&warehouse=snow-warehouse")
    print(c.host)
    print(c.extra_dejson)
  3. An example should be provided of a valid Snowflake construction using the JSON. This example would not only work on its own merits of defining an environment variable connection valid for 2.3.0, but it also would highlight some of the idiosyncrasies of how Airflow defines connections to Snowflake. This would also be valuable as a reference for the AWS SecretsManagerBackend for when full_url_mode is set to False.

Anything else

I wasn't sure whether to label this issue as a provider issue or documentation issue; I saw templates for either but not both.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions