Skip to content

Comments

Feat!: Create physical tables as part of evaluation#5189

Merged
izeigerman merged 22 commits intomainfrom
feat-couple-create-and-evaluate-stages
Aug 20, 2025
Merged

Feat!: Create physical tables as part of evaluation#5189
izeigerman merged 22 commits intomainfrom
feat-couple-create-and-evaluate-stages

Conversation

@izeigerman
Copy link
Member

@izeigerman izeigerman commented Aug 19, 2025

This update eliminates the need for a separate "creating physical tables" step in the plan application. The tables are now created as part of the model evaluation DAG.

This has the following benefits:

  • Fewer queries submitted to the target engine, since the table creation is coupled with model evaluation in a single CREATE OR REPLACE call
  • Better alignment with the execution model of dbt projects which assumes that for each model in the DAG all upstream model both created tables and inserted data into them
  • Only the necessary tables are created. Previously, SQLMesh would preemptively create both the dev and non-dev tables for certain model snapshots. These tables are now created lazily, as needed

Note, that the "creating physical tables" step will still show up if the plan was applied with either the --empty-backfill or the ---skip-backfill flag, since there are no evaluations to perform.

Breaking changes:

  • Macros relying on @IF(@runtime_stage = 'creating', ...) will run every time a table is created for the first time even if data is inserted as part of the creation process (i.e. CREATE OR REPLACE TABLE...). Similarly, macros that rely on the evaluating runtime stage will no longer run as part of the first evaluation that also creates the table for the first time. Previously, there was a clear distinction between empty table creation (creating stage) and data insertion (evaluating stage). Since the two are now coupled, we had to relax the definition of the creating runtime stage to preserve some level of backwards compatibility.

@izeigerman izeigerman requested a review from a team August 19, 2025 20:09

@dataclass(frozen=True)
class EvaluateNode(BaseNode):
snapshot_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already defined in BaseNode. Same applies for CreateNode and Dummy Node.

Copy link
Collaborator

@erindru erindru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I guess we will find out soon how many projects were relying on create first -> then evaluate

**kwargs,
)
if self_referencing:
assert target_columns_to_types is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an explicit check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sharing this thread for some more context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have an explicit check for this on line 455. This is our 2nd entry into the self_referencing check so at this point we indeed assert that target_columns_to_types is not None.

for schema_name, catalog in unique_schemas:
table_exprs = [(gateway, exp.to_table(t)) for gateway, t in gateway_table_pairs]
unique_schemas = {
(gateway, t.args["db"], t.args.get("catalog"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: aren't these available as t.db and t.catalog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are, but the property accessors return strings, whereas looking them up through args results in Expression instances. Given we used to do this:

        unique_schemas = {(t.args["db"], t.args.get("catalog")) for t in table_exprs if t and t.db}

I'm assuming the Expression instances are necessary to preserve identifier quotes, etc.

snapshot.name,
(intervals[batch_idx_to_wait_for], batch_idx_to_wait_for),
)
EvaluateNode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these implicit data structures explicit really helps when trying to understand what is going on



class SchedulingUnit(abc.ABC):
snapshot_name: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make this a dataclass then you don't have to redefine snapshot_name in the classes that inherit from this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when we talked you said you wanted this to be more of an interface but I don't really get the advantage we get from this framing. Either way works though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that a dataclass is a product type and we should be able to instantiate it. This directly conflicts with ABC, which is supposed to provide an interface and can't be instantiated. That's why I'm against mixing the two.

@izeigerman izeigerman merged commit 014fe6a into main Aug 20, 2025
27 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants