feat(ingestion): add snowflake ingestion config options #12841

AndrewSmith593 · 2025-03-11T16:14:15Z

The following config options are added with this PR. allow_empty_schemas was added to avoid a generic permissions error that was being thrown when attempting to ingest schemas with no tables/views. skip_standard_edition_check was added in order to avoid redundant/repetitive/costly calls to show_tags() which was causing unnecessary "Cloud" compute credits to be spent in Snowflake. These config options have been tested thoroughly in Stage environment of Optum datahub fork.

allow_empty_schemas - If set to True, allows schemas with no tables or views to be processed, without reporting generic permissions error. Default is False.

skip_standard_edition_check - If set to True, assumes this is Datahub Enterprise Edition, and skips the check for standard edition. Default is False.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
[] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

codecov · 2025-03-11T16:16:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

hsheth2 · 2025-03-20T20:39:48Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

@@ -550,10 +550,11 @@ def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
            len(discovered_tables) == 0
            and len(discovered_views) == 0
            and len(discovered_streams) == 0
+            and not self.config.allow_empty_schemas


I'm a bit confused by this change - it's called allow_empty_schemas but is around the failure report we produce if no tables/views/streams are found across all databases

We also produce a warning around empty schemas. I'd be ok with an option like warn_on_empty_schemas that is default enabled and can be explicitly disabled.

The intent of this change was to be able to avoid the error we were seeing when ingesting empty schemas: ERROR {datahub.ingestion.source.snowflake.snowflake_utils:254} - permission-error => No tables/views found. Please check permissions.

The error implied a permissions error when in fact we had permissions, but the schema showed as empty in source Snowflake.

I could change this to a warn_on_empty_schemas to warn but not throw an error on empty schemas, which is default enabled.

Went with warn_no_datasets to make it more generic

hsheth2 · 2025-03-20T20:41:35Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

@@ -732,6 +733,8 @@ def get_snowsight_url_builder(self) -> Optional[SnowsightUrlBuilder]:
            return None

    def is_standard_edition(self) -> bool:
+        if self.config.skip_standard_edition_check:
+            return False


the PR description said that the show tags() call was costly. I believe we call it exactly once per ingestion run - is even a single invocation super costly?

Also, is there a better/cheaper way to automatically determine if we're on standard/enterprise edition? I prefer to make things automated instead of requiring a config for it

We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances. The show tags() call would add up to a considerable amount of snowflake compute credits over time with this many schemas in our process, which runs nightly.

The standard edition check is a valid way to determine edition, but for edge cases like this where compute cost over time is considerable, I think this config option is a viable option.

We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances

To be clear - having 9000+ ingestion recipes is not the recommended way to set up ingestion. I would recommend setting up fewer ingestions, and having those ingest metadata from multiple schemas.

Right, we are working on reducing the number of parallel ingestions. The purpose of running it with 1 schema per pod is for observability and resilience to prevent to entire pipeline from stopping due to an error with single schema.

jjoyce0510 · 2025-03-21T15:58:04Z

Once comments are addressed, let's get this in! Thanks team

…-options

rtekal · 2025-04-15T12:41:47Z

@hsheth2,
We addressed all the review comments. Would you please review the PR again. Thanks in advance.

hsheth2 · 2025-04-15T21:53:30Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

@@ -732,6 +733,8 @@ def get_snowsight_url_builder(self) -> Optional[SnowsightUrlBuilder]:
            return None

    def is_standard_edition(self) -> bool:
+        if self.config.skip_standard_edition_check:
+            return False


We have implemented our ingestion with one schema per recipe, and have 9000+ schemas per full ingestion from one of our Snowflake platform instances

To be clear - having 9000+ ingestion recipes is not the recommended way to set up ingestion. I would recommend setting up fewer ingestions, and having those ingest metadata from multiple schemas.

hsheth2 · 2025-04-15T21:57:01Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

@@ -325,6 +325,16 @@ class SnowflakeV2Config(
        " Map of share name -> details of share.",
    )

+    skip_standard_edition_check: bool = Field(


imo we should have something more like this

if unset, we will infer the edition using the show tags command

Suggested change

skip_standard_edition_check: bool = Field(

known_snowflake_edition: Optional[SnowflakeEdition] = Field(

it also needs some tests to go along with it

hsheth2 · 2025-04-15T22:01:51Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

+
+    warn_no_datasets: bool = Field(
+        default=True,
+        description="If True, warns when no datasets are found during ingestion. If False, ingestion fails when no datasets are found.",


This flag is not particularly well aligned with our vision for how ingestion should be set up.

I'm willing to accept it if

it is marked as hidden_from_docs=True. As such, it's also not subject to the same backwards compatibility guarantees as most other ingestion configs

we add a comment explaining the context around why it was added in the first place

it comes with the understanding that if at some point, it becomes difficult for us to maintain this flag, we may remove it

Added hidden_from_docs=True and comment for context.

I understand that this may be deprecated in the future, if it is removed, then our version may stop working when the changes are pulled in.

Since it is possible to create empty containers in Data Catalog if a schema does not contain tables, we are wondering why we don't want to allow ingesting empty schemas as an empty container.

In most cases, empty schemas indicate a permissions issue. We're optimizing for a good user experience for that typical user.

metadata-ingestion/docs/sources/snowflake/snowflake_recipe.yml

hsheth2 · 2025-04-28T03:26:51Z

metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py

+        sf_cursor.execute.side_effect = query_permission_response_override(
+            default_query_results,
+            [SnowflakeQuery.current_region()],
+            [{"CURRENT_REGION()": "AWS_AP_SOUTH_1"}],


why is it required for us to override the region in this test?

hsheth2 · 2025-04-28T03:27:54Z

metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py

+
+
+@freeze_time(FROZEN_TIME)
+def test_snowflake_is_standard_explicit_false(pytestconfig, snowflake_pipeline_config):


These tests look largely like they're copy-pasted from each other. In general, that's not a good sign. You might want to look into pytest test parameterization.

AndrewSmith593 added 3 commits March 11, 2025 10:26

feat(ingestion): add snowflake ingestion config options

7a6cd88

feat(ingestion): implement new snowflake config options

25a9f75

docs(snowflake ingestion): document new snowflake ingest config options

66e39c3

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 11, 2025

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Mar 11, 2025

vercel bot deployed to Preview March 11, 2025 16:30 View deployment

hsheth2 requested changes Mar 20, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 20, 2025

AndrewSmith593 added 3 commits March 28, 2025 16:22

Merge branch 'datahub-project:master' into snowflake-ingestion-config…

339b619

…-options

feat(ingestion): implement warn on no datasets config

a278ab8

feat(ingestion): change empty schemas config to warn on no datasets

c06cfd4

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 28, 2025

vercel bot deployed to Preview March 28, 2025 21:54 View deployment

AndrewSmith593 added 2 commits March 31, 2025 10:36

Merge branch 'datahub-project:master' into snowflake-ingestion-config…

30973ab

…-options

docs(snowflake ingestion): update doc snowflake ingest config options

8e7048d

vercel bot deployed to Preview March 31, 2025 16:08 View deployment

test(snowflake ingestion): add test case for warn on no datasets

876e3e8

vercel bot deployed to Preview March 31, 2025 19:48 View deployment

test(snowflake ingestion): fix config access in test

76794d4

vercel bot deployed to Preview March 31, 2025 20:54 View deployment

AndrewSmith593 and others added 2 commits April 1, 2025 15:24

Merge branch 'datahub-project:master' into snowflake-ingestion-config…

bb6e50e

…-options

fix linting errors with black formatter

20b1495

vercel bot deployed to Preview April 1, 2025 21:08 View deployment

hsheth2 requested changes Apr 15, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Apr 15, 2025

AndrewSmith593 changed the title ~~Add Snowflake ingestion config options: allow_empty_schemas, skip_standard_edition_check~~ feat(ingestion): add snowflake ingestion config options Apr 23, 2025

address pr comments add tests

be42912

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Apr 25, 2025

vercel bot deployed to Preview April 25, 2025 20:12 View deployment

Merge branch 'master' into snowflake-ingestion-config-options

3c20987

vercel bot deployed to Preview April 25, 2025 20:50 View deployment

hsheth2 reviewed Apr 28, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Apr 28, 2025

use parametrize in known_snowflake_edition tests

28303c7

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels May 15, 2025

vercel bot had a problem deploying to Preview May 15, 2025 20:38 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion): add snowflake ingestion config options #12841

feat(ingestion): add snowflake ingestion config options #12841

AndrewSmith593 commented Mar 11, 2025

codecov bot commented Mar 11, 2025 •

edited

Loading

hsheth2 Mar 20, 2025

AndrewSmith593 Mar 28, 2025

AndrewSmith593 Mar 28, 2025

hsheth2 Mar 20, 2025

AndrewSmith593 Mar 28, 2025

hsheth2 Apr 15, 2025

AndrewSmith593 Apr 23, 2025

jjoyce0510 commented Mar 21, 2025

rtekal commented Apr 15, 2025

hsheth2 Apr 15, 2025

hsheth2 Apr 15, 2025

hsheth2 Apr 15, 2025

hsheth2 Apr 15, 2025

AndrewSmith593 Apr 23, 2025

AndrewSmith593 Apr 30, 2025

hsheth2 May 12, 2025

hsheth2 Apr 28, 2025

hsheth2 Apr 28, 2025

	skip_standard_edition_check: bool = Field(
	known_snowflake_edition: Optional[SnowflakeEdition] = Field(



		@freeze_time(FROZEN_TIME)
		def test_snowflake_is_standard_explicit_false(pytestconfig, snowflake_pipeline_config):

feat(ingestion): add snowflake ingestion config options #12841

Are you sure you want to change the base?

feat(ingestion): add snowflake ingestion config options #12841

Conversation

AndrewSmith593 commented Mar 11, 2025

Checklist

codecov bot commented Mar 11, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjoyce0510 commented Mar 21, 2025

rtekal commented Apr 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 11, 2025 •

edited

Loading