Skip to content

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

Closed
@graciegoheen

Description

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I think for external tables the information_schema.tables.last_altered column is the time of the last refresh (or last ALTER), not the last time a file came in.

So for example if an external table has stale data, an ALTER TABLE ... REFRESH is performed but no new files are added (because the pipeline that is dropping fiels into the S3 bucket is broken for example), I think the last_altered column will have the timestamp of when the ALTER TABLE ... REFRESH that was performed. If this will cover external tables and my understanding of the metadata is correct, then this might lead to the dbt metadata reflecting that the external table is fresher than it actually is thus obfuscating an issue (a pipeline being broken)

The directory (if enabled) of the external stage would have the list of files and when they were added to the stage. So pretty much just wondering if this feature will be a behind the scenes query of the information_schema, or if freshness source can be a user defined query like originally outlined here #7012

Originally posted by @sp-tkerlavage in dbt-labs/dbt-snowflake#785 (comment)

We currently support 2 ways to generate source freshness (via warehouse metadata tables & via a loaded_at_field). We should support a 3rd way to generate source freshness - via a loaded_at_query.

Option 1: freshness config added (get freshness from metadata warehouse tables)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day

Option 2: freshness config added with loaded_at_field (get freshness from select max(loaded_at_field) … from this …)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
    loaded_at_field:

Option 3: freshness config added with "how does dbt consider a source to be fresh"? (get equivalent of "max(loaded_at_field)" from executing a custom query)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
      loaded_at_query: my_freshness_calc # config name TBD
{% freshness my_freshness_calc(source, table) %}

some sql returns a structure similar to our current query

{% endfreshness %}

The built-in one is collect_freshness: https://github.com/dbt-labs/dbt-adapters/blob/6c41bedf27063eda64375845db6ce5f7535ef6aa/dbt/include/global_project/macros/adapters/freshness.sql#L4-L16

Describe alternatives you've considered

No response

Who will this benefit?

Folks using external tables as sources

Are you interested in contributing this feature?

No response

Anything else?

No response

Metadata

Assignees

No one assigned

    Labels

    Impact: OrchenhancementNew feature or requestuser docs[docs.getdbt.com] Needs better documentation

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions