feat: Smart `dlt.Relation.join()` #2960

zilto · 2025-08-05T00:33:17Z

This adds convenience methods dlt.Relation(...).join() and dlt.Relation(...).join_child(...) to simplify joins between tables that have TTableReference on dlt.Schema.

Design decisions

Joins for parent-child or root-child relationships are likely always LEFT joins.
Joins for arbitrary references defaults to INNER, but could be any type
For parent-child / root-child relationships, trim the name of the table based on their common prefix
Namespace all columns with origin table name
Add "metadata columns" at the beginning of the table
Special renaming for metadata columns
.join() can receive:
- dlt.Relation to join on (if column is unambiguous)
- "table_name" (if column is unambiguous)
- "table_name.column_name" in case of multiple references between tables

Open questions

What is the public API? Is it useful to have .join() and .join_parent_child() separately?

netlify · 2025-08-05T00:33:23Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`5f0c04e`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/689159831cc1d50008c00382

rudolfix

this is pretty cool. Top level comments:

I think we should have one join method on relation
I assumed that we support joins of any depth in my review. that impacts aliasing
I assumed sqlglot is not too smart with generating unique aliases for columns and tables when joining
I linked some code that is generating fact tables. the goal is more or less the same as here

rudolfix · 2025-08-06T10:38:30Z

dlt/destinations/dataset/relation.py

+        else:
+            raise TypeError
+
+        current_table_name = rel._sqlglot_expression.find(sge.From).name


will it work with multiple join? ie. I have nesting level 2 and I use join twice. hmmm I think so. or multi joins are out of scope?

This work was based on the incremental loading PRs were you suggested to not do recursive joins and only support immediate parent-child and root-child joins.

For this PR I believe we could support

""" tables with `root_key=True`: - orders - orders__items - orders__items__details - customers # with ref `orders.customer_id == customers.id` """ # parent-child dataset.table("orders").join("orders__items") # LEFT JOIN orders._dlt_id = orders__items._dlt_parent_id dataset.table("orders__items").join("orders__items__details") # LEFT JOIN orders__items._dlt_id = orders__items__details._dlt_parent_id # root-descendant dataset.table("orders").join("orders__items__details") # LEFT JOIN orders._dlt_id = orders__items__details._dlt_root_id # relationship dataset.table("orders").join("customers") # INNER JOIN orders.customer_id = customers.id dataset.table("customers").join("orders") # INNER JOIN customer.id = orders.customer_id

Currently excluded:

# recursive parent-child # recursive API dataset.table("orders").join("orders__items").join("orders__items__details") # using a list API is easier dataset.table("orders").join(["orders__items", "orders__items__details"])

This has new implications

We're keeping columns from 3 tables instead of 2

It's different from root-descendant because we are including intermediary parents which have no descendant rows.

Providing convenience table/column aliases is much harder for recursive API because we need to introspect the previous query that renamed columns

rudolfix · 2025-08-06T10:44:08Z

dlt/destinations/dataset/relation.py

+            raise TypeError
+
+        current_table_name = rel._sqlglot_expression.find(sge.From).name
+        current_table_schema = self._dataset.schema.tables[current_table_name]


you should take column names from the (qualified) relation's query. so you can identify whose columns are coming from current_table_name, you should also recover aliases here to create column alias in final SELECT

this will let you create joins on relations that do not select all columns

rudolfix · 2025-08-06T10:44:23Z

dlt/destinations/dataset/relation.py

+        if references is None:
+            raise KeyError(f"Not references found on schema for table `{current_table_name}`")
+
+        other_table_name = other_rel._sqlglot_expression.find(sge.From).name


rudolfix · 2025-08-06T10:47:11Z

dlt/destinations/dataset/relation.py

+        current_cols_aliases = []
+        other_cols_aliases = []
+
+        for col in current_table_schema["columns"]:


here you are comparing normalized identifiers with non normalized constants (C_DLT_LOAD_ID). here's how we identify dlt columns and tables in filesystem:

def is_dlt_identifier(self, ident: str) -> bool: return ident.startswith(self.schema._dlt_tables_prefix)

maybe it makes sense to add it to Schema object

rudolfix · 2025-08-06T10:50:01Z

dlt/destinations/queries.py

    return select_expr
+
+
+def _create_column_alias(


yes this is good! but please use naming convention to create "compound" identifiers ie. this will split parts into single identifiers and then create compound one (takes into account that parts may be already compound)

def normalize_alias(c: DbtGeneratorConfig, *parts: str) -> str: return c.naming.normalize_path(c.naming.make_path(*parts))

rudolfix · 2025-08-06T11:10:13Z

dlt/destinations/dataset/relation.py

+            alias = _create_column_alias(parent_table_name, col)
+            parent_cols_aliases.append(alias)
+
+        _, _, child_table_part = child_table_name.removeprefix(parent_table_name).partition("__")


you should use break_path and make_path in schema.naming.

rudolfix · 2025-08-06T11:11:03Z

dlt/destinations/dataset/relation.py

+
+        parent_cols_aliases = []
+        for col in parent_table_schema["columns"]:
+            if col in ["_dlt_id", "_dlt_load_id"]:


please compare only normalize identifiers. it may make sense to cache _dlt_id and _dlt_load_id in schema along a few others that are already there:

self._dlt_tables_prefix = to_naming.normalize_table_identifier(DLT_NAME_PREFIX) self.version_table_name = to_naming.normalize_table_identifier(VERSION_TABLE_NAME) self.loads_table_name = to_naming.normalize_table_identifier(LOADS_TABLE_NAME) self.state_table_name = to_naming.normalize_table_identifier(PIPELINE_STATE_TABLE_NAME)

rudolfix · 2025-08-06T11:12:39Z

dlt/destinations/dataset/relation.py

+                *other_cols_aliases,
+            )
+            .from_(current_table_name)
+            .join(other_table_name, on=join_condition, join_type=how)


what happens if other relation already has a join? ie. users.join(orders.join(items)? is sqlglot able to generate proper aliasing itself? if not you'll need to assign aliases to the tables ie orders__items. same table may be joined several times. also you remove prefixes from nested tables. that may create conflicts.

rudolfix · 2025-08-06T11:13:50Z

dlt/destinations/dataset/relation.py

+            child_cols_aliases.append(alias)
+
+        meta_cols_aliases = (
+            _create_column_alias(parent_table_name, "_dlt_load_id", prefix="", separator=""),


this is done so query looks nicer?

rudolfix · 2025-08-06T11:15:41Z

dlt/destinations/dataset/relation.py

        rel._sqlglot_expression = rel._sqlglot_expression.where(condition)
        return rel

+    def join(


We already have join generator which works pretty well for generating dbt packages. It supports both regular and nested references and also joins of unlimited depth (via table and column aliasing). Maybe some code or ideas could be helpful here?

here we render fact table: https://github.com/dlt-hub/dlt-plus/blob/devel/packages/dlt_plus/dlt_plus/dbt_generator/render.py#L72

this procedure renders fact table (indicated by config): it will create all possible join (regular and nested), also recursively up to a certain depth. your case is simpler (just one join) but the resultant relation should allow to add more joins (both on the same level and joins of join). or this is out of scope?

first we collect all references, validate them, and convert nested (automated) references into regular references (get_rendered_tables)

then required tables are added, here the crucial part is how aliasing is done. all tables except top (fact) table will be aliased uniquely. see https://github.com/dlt-hub/dlt-plus/blob/devel/packages/dlt_plus/dlt_plus/dbt_generator/render.py#L85 and https://github.com/dlt-hub/dlt-plus/blob/devel/packages/dlt_plus/dlt_plus/dbt_generator/render.py#L113

all columns that are added to the joined relation are uniquely aliased as well: https://github.com/dlt-hub/dlt-plus/blob/devel/packages/dlt_plus/dlt_plus/dbt_generator/render.py#L143 table aliases, not names are used to make sure that columns in SELECT list are unique

zilto added 3 commits August 4, 2025 17:04

cherry picking

26491be

wip

a91f7d6

typing fixes

fce8f54

zilto self-assigned this Aug 5, 2025

zilto added the enhancement New feature or request label Aug 5, 2025

undo typing change; keep for later

5f0c04e

rudolfix requested changes Aug 6, 2025

View reviewed changes

zilto mentioned this pull request Sep 16, 2025

feat: extend TTableReference #3093

Merged

sh-rp force-pushed the devel branch from 6bbc3e1 to df8ccec Compare October 28, 2025 16:29

feat: Smart dlt.Relation.join() #2960

Are you sure you want to change the base?

feat: Smart dlt.Relation.join() #2960

Uh oh!

Conversation

zilto commented Aug 5, 2025

Design decisions

Open questions

Uh oh!

netlify bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs canceled.

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Smart `dlt.Relation.join()` #2960

feat: Smart `dlt.Relation.join()` #2960

netlify bot commented Aug 5, 2025 •

edited

Loading