Skip to content

feat(ingestion): add snowflake ingestion config options #12841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

AndrewSmith593
Copy link

The following config options are added with this PR. allow_empty_schemas was added to avoid a generic permissions error that was being thrown when attempting to ingest schemas with no tables/views. skip_standard_edition_check was added in order to avoid redundant/repetitive/costly calls to show_tags() which was causing unnecessary "Cloud" compute credits to be spent in Snowflake. These config options have been tested thoroughly in Stage environment of Optum datahub fork.

allow_empty_schemas - If set to True, allows schemas with no tables or views to be processed, without reporting generic permissions error. Default is False.

skip_standard_edition_check - If set to True, assumes this is Datahub Enterprise Edition, and skips the check for standard edition. Default is False.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • [] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 11, 2025
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Mar 11, 2025
@@ -550,10 +550,11 @@ def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
len(discovered_tables) == 0
and len(discovered_views) == 0
and len(discovered_streams) == 0
and not self.config.allow_empty_schemas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this change - it's called allow_empty_schemas but is around the failure report we produce if no tables/views/streams are found across all databases

We also produce a warning around empty schemas. I'd be ok with an option like warn_on_empty_schemas that is default enabled and can be explicitly disabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of this change was to be able to avoid the error we were seeing when ingesting empty schemas: ERROR {datahub.ingestion.source.snowflake.snowflake_utils:254} - permission-error => No tables/views found. Please check permissions.

The error implied a permissions error when in fact we had permissions, but the schema showed as empty in source Snowflake.

I could change this to a warn_on_empty_schemas to warn but not throw an error on empty schemas, which is default enabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with warn_no_datasets to make it more generic

@@ -732,6 +733,8 @@ def get_snowsight_url_builder(self) -> Optional[SnowsightUrlBuilder]:
return None

def is_standard_edition(self) -> bool:
if self.config.skip_standard_edition_check:
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR description said that the show tags() call was costly. I believe we call it exactly once per ingestion run - is even a single invocation super costly?

Also, is there a better/cheaper way to automatically determine if we're on standard/enterprise edition? I prefer to make things automated instead of requiring a config for it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances. The show tags() call would add up to a considerable amount of snowflake compute credits over time with this many schemas in our process, which runs nightly.

The standard edition check is a valid way to determine edition, but for edge cases like this where compute cost over time is considerable, I think this config option is a viable option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances

To be clear - having 9000+ ingestion recipes is not the recommended way to set up ingestion. I would recommend setting up fewer ingestions, and having those ingest metadata from multiple schemas.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we are working on reducing the number of parallel ingestions. The purpose of running it with 1 schema per pod is for observability and resilience to prevent to entire pipeline from stopping due to an error with single schema.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 20, 2025
@jjoyce0510
Copy link
Collaborator

Once comments are addressed, let's get this in! Thanks team

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 28, 2025
@rtekal
Copy link
Contributor

rtekal commented Apr 15, 2025

@hsheth2,
We addressed all the review comments. Would you please review the PR again. Thanks in advance.

@@ -732,6 +733,8 @@ def get_snowsight_url_builder(self) -> Optional[SnowsightUrlBuilder]:
return None

def is_standard_edition(self) -> bool:
if self.config.skip_standard_edition_check:
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances

To be clear - having 9000+ ingestion recipes is not the recommended way to set up ingestion. I would recommend setting up fewer ingestions, and having those ingest metadata from multiple schemas.

@@ -325,6 +325,16 @@ class SnowflakeV2Config(
" Map of share name -> details of share.",
)

skip_standard_edition_check: bool = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we should have something more like this

if unset, we will infer the edition using the show tags command

Suggested change
skip_standard_edition_check: bool = Field(
known_snowflake_edition: Optional[SnowflakeEdition] = Field(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also needs some tests to go along with it


warn_no_datasets: bool = Field(
default=True,
description="If True, warns when no datasets are found during ingestion. If False, ingestion fails when no datasets are found.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is not particularly well aligned with our vision for how ingestion should be set up.

I'm willing to accept it if

  1. it is marked as hidden_from_docs=True. As such, it's also not subject to the same backwards compatibility guarantees as most other ingestion configs
  2. we add a comment explaining the context around why it was added in the first place
  3. it comes with the understanding that if at some point, it becomes difficult for us to maintain this flag, we may remove it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added hidden_from_docs=True and comment for context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this may be deprecated in the future, if it is removed, then our version may stop working when the changes are pulled in.

Since it is possible to create empty containers in Data Catalog if a schema does not contain tables, we are wondering why we don't want to allow ingesting empty schemas as an empty container.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, empty schemas indicate a permissions issue. We're optimizing for a good user experience for that typical user.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Apr 15, 2025
@AndrewSmith593 AndrewSmith593 changed the title Add Snowflake ingestion config options: allow_empty_schemas, skip_standard_edition_check feat(ingestion): add snowflake ingestion config options Apr 23, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Apr 25, 2025
sf_cursor.execute.side_effect = query_permission_response_override(
default_query_results,
[SnowflakeQuery.current_region()],
[{"CURRENT_REGION()": "AWS_AP_SOUTH_1"}],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it required for us to override the region in this test?



@freeze_time(FROZEN_TIME)
def test_snowflake_is_standard_explicit_false(pytestconfig, snowflake_pipeline_config):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look largely like they're copy-pasted from each other. In general, that's not a good sign. You might want to look into pytest test parameterization.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Apr 28, 2025
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants