[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797
Description
Is this your first time submitting a feature request?
- I have read the expectations for open source contributors
- I have searched the existing issues, and I could not find an existing issue for this feature
- I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion
Describe the feature
I think for external tables the
information_schema.tables.last_altered column
is the time of the last refresh (or last ALTER), not the last time a file came in.So for example if an external table has stale data, an
ALTER TABLE ... REFRESH
is performed but no new files are added (because the pipeline that is dropping fiels into the S3 bucket is broken for example), I think thelast_altered
column will have the timestamp of when theALTER TABLE ... REFRESH
that was performed. If this will cover external tables and my understanding of the metadata is correct, then this might lead to the dbt metadata reflecting that the external table is fresher than it actually is thus obfuscating an issue (a pipeline being broken)The
directory
(if enabled) of the external stage would have the list of files and when they were added to the stage. So pretty much just wondering if this feature will be a behind the scenes query of the information_schema, or if freshness source can be a user defined query like originally outlined here #7012Originally posted by @sp-tkerlavage in dbt-labs/dbt-snowflake#785 (comment)
We currently support 2 ways to generate source freshness (via warehouse metadata tables & via a loaded_at_field
). We should support a 3rd way to generate source freshness - via a loaded_at_query
.
Option 1: freshness config added (get freshness from metadata warehouse tables)
sources:
- name: my_source
freshness:
warn_after:
count: 1
period: day
error_after:
count: 5
period: day
Option 2: freshness config added with loaded_at_field (get freshness from select max(loaded_at_field) … from this …)
sources:
- name: my_source
freshness:
warn_after:
count: 1
period: day
error_after:
count: 5
period: day
loaded_at_field:
Option 3: freshness config added with "how does dbt consider a source to be fresh"? (get equivalent of "max(loaded_at_field)
" from executing a custom query)
sources:
- name: my_source
freshness:
warn_after:
count: 1
period: day
error_after:
count: 5
period: day
loaded_at_query: my_freshness_calc # config name TBD
{% freshness my_freshness_calc(source, table) %}
some sql returns a structure similar to our current query
{% endfreshness %}
The built-in one is collect_freshness
: https://github.com/dbt-labs/dbt-adapters/blob/6c41bedf27063eda64375845db6ce5f7535ef6aa/dbt/include/global_project/macros/adapters/freshness.sql#L4-L16
Describe alternatives you've considered
No response
Who will this benefit?
Folks using external tables as sources
Are you interested in contributing this feature?
No response
Anything else?
No response