Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to infer schemas on BigQuery #249

Closed
jbergeskans opened this issue Jan 17, 2024 · 5 comments
Closed

Add support to infer schemas on BigQuery #249

jbergeskans opened this issue Jan 17, 2024 · 5 comments
Labels
enhancement New feature or request triage
Milestone

Comments

@jbergeskans
Copy link

Describe the feature

Allow for the BigQuery plugin to infer schemas in the case that the user has chosen to document the external table for dbt docs (descriptions, constraints etc) but does not wish to also add data types for all columns.

Describe alternatives you've considered

  • Fully document the solution which, in our case, is several thousands of lines over a dozen of external tables.
  • Document the fields in models using these tables as sources

Additional context

Currently, if a user adds column descriptions to a sources file, the package will generate a table creation query using the column information available in the sources file. If no data_type is set, the creation of the table will fail with the error message:

15:10:27  Encountered an error while running operation: Database Error
  Type not found: None at [x:y]

This is because the generated query will look like this:

create or replace external table `my_project`.`my_dataset`.`some_table_name`(
        
        pk_col None,
        data_col None,
        loaded_date None)


with partition columns (
        loaded_date date) 

options (
        uris = ['gs://bucket/folder/*'], format = 'parquet', hive_partition_uri_prefix = 'gs://bucket/folder/')

Which happens because of this code block

This solution is already implemented in the Snowflake plugin and I don't see that the BigQuery fix needs to be more complicated than allowing to pass in infer_schema and checking its truthiness.

Who will this benefit?

This will allow for users to infer the schema while checking descriptions and constraints in the dbt docs.

@jbergeskans jbergeskans added enhancement New feature or request triage labels Jan 17, 2024
@jbergeskans
Copy link
Author

Willing to create a PR of this, assuming the fix is as straightforward as it seems.

@lassebenni
Copy link
Contributor

I second this feature!

@dataders dataders added this to the 1.0.0 milestone Apr 4, 2024
@thomas-vl
Copy link
Contributor

For me this feels very conflicting, you want to infer the schema automatically but do want to manually add the column names for documentation.

I see two problems with this setup:

The inferred schema in BigQuery might be different than the columns you put in manually so the documentation no longer reflects reality.
What will happen when you apply a data test on a column that is not inferred in BigQuery because its removed from the parquet file?

@dataders
Copy link
Collaborator

dataders commented Apr 10, 2024

@jbergeskans I agree that manually specifying column names and datatypes can be a pain!

I looked at #259, but the implementation is different from that of Snowflake's. For dbt-snowflake's version, if infer_schema: true then the following happens:

  1. a one-off select * from table( infer_schema( location=>'{{external.location}}', file_format=>'{{external.file_format}}') ) is executed with the results stored as columns_infer (source: macros/plugins/snowflake/create_external_table.sql#L8-L15)
  2. the results of columns_infer are iterated over to generate the correct datatypes for the CREATE EXTERNAL TABLE statement (source)

I agree with @thomas-vl and am conflicted about your implementation. However, I'm not in love with the precedent that is Snowflake's implementation either.

@jbergeskans, am I right in guessing that if the sources.yml columns names and types were generated for you, then you wouldn't need schema inference? Especially in your case where you're happy to enumerate the column names because it provides value?

What you are not looking for is something that would allow for schema evolution of an external table, as in: if actual external files have one of the below changes, then dbt shouldn't flinch and make the table accordingly:

  • a new column is added
  • an existing columns type has changed from what existed previously

My gut tells me that we DON'T want auto-schema evolution -- data changing silently should not be a feature right?

I've also opened dbt-labs/dbt-codegen#173, this feels like the "right solution"™️ to me.

@jbergeskans
Copy link
Author

@dataders fully agree that silently changing the schema is not something that should ever be allowed. And yes, I would be happy enough if this was generated for me so that I could easily go in and add tests / apply policy tags / add constraints where needed.

In this case what we had was a spreadsheet with column names and their descriptions. So I wrote a script to generate the yaml file based on the spreadsheet and we ran into this issue. But if I could've get the field names and data types generated for me, I'd 100% take that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants