-
Notifications
You must be signed in to change notification settings - Fork 3.1k
feat(ingestion): add snowflake ingestion config options #12841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(ingestion): add snowflake ingestion config options #12841
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ 📢 Thoughts on this report? Let us know! |
@@ -550,10 +550,11 @@ def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]: | |||
len(discovered_tables) == 0 | |||
and len(discovered_views) == 0 | |||
and len(discovered_streams) == 0 | |||
and not self.config.allow_empty_schemas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this change - it's called allow_empty_schemas
but is around the failure report we produce if no tables/views/streams are found across all databases
We also produce a warning around empty schemas. I'd be ok with an option like warn_on_empty_schemas
that is default enabled and can be explicitly disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent of this change was to be able to avoid the error we were seeing when ingesting empty schemas: ERROR {datahub.ingestion.source.snowflake.snowflake_utils:254} - permission-error => No tables/views found. Please check permissions.
The error implied a permissions error when in fact we had permissions, but the schema showed as empty in source Snowflake.
I could change this to a warn_on_empty_schemas
to warn but not throw an error on empty schemas, which is default enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with warn_no_datasets
to make it more generic
@@ -732,6 +733,8 @@ def get_snowsight_url_builder(self) -> Optional[SnowsightUrlBuilder]: | |||
return None | |||
|
|||
def is_standard_edition(self) -> bool: | |||
if self.config.skip_standard_edition_check: | |||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the PR description said that the show tags()
call was costly. I believe we call it exactly once per ingestion run - is even a single invocation super costly?
Also, is there a better/cheaper way to automatically determine if we're on standard/enterprise edition? I prefer to make things automated instead of requiring a config for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances. The show tags()
call would add up to a considerable amount of snowflake compute credits over time with this many schemas in our process, which runs nightly.
The standard edition check is a valid way to determine edition, but for edge cases like this where compute cost over time is considerable, I think this config option is a viable option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances
To be clear - having 9000+ ingestion recipes is not the recommended way to set up ingestion. I would recommend setting up fewer ingestions, and having those ingest metadata from multiple schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we are working on reducing the number of parallel ingestions. The purpose of running it with 1 schema per pod is for observability and resilience to prevent to entire pipeline from stopping due to an error with single schema.
Once comments are addressed, let's get this in! Thanks team |
@hsheth2, |
@@ -732,6 +733,8 @@ def get_snowsight_url_builder(self) -> Optional[SnowsightUrlBuilder]: | |||
return None | |||
|
|||
def is_standard_edition(self) -> bool: | |||
if self.config.skip_standard_edition_check: | |||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances
To be clear - having 9000+ ingestion recipes is not the recommended way to set up ingestion. I would recommend setting up fewer ingestions, and having those ingest metadata from multiple schemas.
@@ -325,6 +325,16 @@ class SnowflakeV2Config( | |||
" Map of share name -> details of share.", | |||
) | |||
|
|||
skip_standard_edition_check: bool = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we should have something more like this
if unset, we will infer the edition using the show tags command
skip_standard_edition_check: bool = Field( | |
known_snowflake_edition: Optional[SnowflakeEdition] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also needs some tests to go along with it
|
||
warn_no_datasets: bool = Field( | ||
default=True, | ||
description="If True, warns when no datasets are found during ingestion. If False, ingestion fails when no datasets are found.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag is not particularly well aligned with our vision for how ingestion should be set up.
I'm willing to accept it if
- it is marked as
hidden_from_docs=True
. As such, it's also not subject to the same backwards compatibility guarantees as most other ingestion configs - we add a comment explaining the context around why it was added in the first place
- it comes with the understanding that if at some point, it becomes difficult for us to maintain this flag, we may remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added hidden_from_docs=True
and comment for context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this may be deprecated in the future, if it is removed, then our version may stop working when the changes are pulled in.
Since it is possible to create empty containers in Data Catalog if a schema does not contain tables, we are wondering why we don't want to allow ingesting empty schemas as an empty container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, empty schemas indicate a permissions issue. We're optimizing for a good user experience for that typical user.
sf_cursor.execute.side_effect = query_permission_response_override( | ||
default_query_results, | ||
[SnowflakeQuery.current_region()], | ||
[{"CURRENT_REGION()": "AWS_AP_SOUTH_1"}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it required for us to override the region in this test?
|
||
|
||
@freeze_time(FROZEN_TIME) | ||
def test_snowflake_is_standard_explicit_false(pytestconfig, snowflake_pipeline_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests look largely like they're copy-pasted from each other. In general, that's not a good sign. You might want to look into pytest test parameterization.
The following config options are added with this PR. allow_empty_schemas was added to avoid a generic permissions error that was being thrown when attempting to ingest schemas with no tables/views. skip_standard_edition_check was added in order to avoid redundant/repetitive/costly calls to
show_tags()
which was causing unnecessary "Cloud" compute credits to be spent in Snowflake. These config options have been tested thoroughly in Stage environment of Optum datahub fork.allow_empty_schemas
- If set to True, allows schemas with no tables or views to be processed, without reporting generic permissions error. Default is False.skip_standard_edition_check
- If set to True, assumes this is Datahub Enterprise Edition, and skips the check for standard edition. Default is False.Checklist