Skip to content

Add pseudo selectors that select models based on artifact states #2465

Closed
@drewbanin

Description

See also #2172, #2425

Describe the feature

This change would be in support of:

  1. Improved dev experiences
  2. Slimmer CI builds

If dbt is provided artifacts (manifest, run_results) produced from a previous run of dbt, then dbt will be able to determine:

  1. New nodes
  2. Changed nodes
  3. Nodes that failed to build in a previous invocation

Here are some high-level example usage scenarios:

# Run new and changed models (and their descendants) in a CI build
$ dbt --state prod-target/ run --models @state:modified

# Re-run failed models and their children in development (or, re-run a prod job that failed)
$ dbt --state target/ run --models build:error+

# Re-run failed models and their children in development
# Note: --state is implied to be target/ here
$ dbt run --models build:error+

Implementation details

dbt is going to need to point to the artifacts from a previous invocation to compare manifests or determine build statuses from a previous run. To accomplish this, we could add a flag like --state which should point to a folder containing the manifest and run_results from a previous invocation of dbt. It will be the users responsibility to make sure these artifacts are present in their environment.

--state flag:

  • This flag probably makes the most sense as a flag to dbt, as it will apply to many subcommands (eg. compile, run, test, seed, snapshot, and ls). It can definitely be a flag to subcommands (or both) if that makes sense
  • The default value should be target/
  • If the expected state files are not present, dbt should run successfully, but selectors based on this state information should fail if used.
    • eg. dbt run --models build:error will fail with an appropriate error if the target/ dir does not exist

Selectors:

  • state:modified: Will select any nodes whose hashes have changed compared to the value present in the manifest artifact
  • state:new: Will select any nodes which are present in the project but are not present in the manifest artifact
  • We'll probably want to provide some shorthand that selects new & changed files for local dev
  • build:error: Will select any nodes which errored or were skipped in run_results state artifact
  • build:success: I don't know that there's a concrete use-case for something like this, but it seems sensible to implement selectors for different states

Determining nodes that have changed
This is a tricky problem! A very simple version of this functionality can be implemented with a git diff --name-only. That will get you pretty far, but it will not account for:

  • models that should be considered changed because they reference a macro that has changed
  • schema.yml files (it's tough to correlate .yml file changes to dbt nodes, at least as far as git is concerned)
  • the global impacts of changes to specific macros (eg. generate_schema_name) or the dbt_project.yml file

Describe alternatives you've considered

  • Git trickery: This is an incomplete solution and won't fare super well in CI envs, but might be hackable in local dev work

Who will this benefit?

  • People who run dbt jobs in their CI envs
  • People who are making iterative changes in development
  • We could add a "Rerun from failed" button in dbt Cloud, and folks running dbt in their own prod envs could do something similar (eg. in an Airflow error handler) for intermittent build failures

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestnode selectionFunctionality and syntax for selecting DAG nodesstateStateful selection (state:modified, defer)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions