Skip to content

feat: implement join compatibility check for destinations #3758

@rudolfix

Description

@rudolfix

Background

We need to determine whether two dlt.Dataset instances can be joined in a single SQL query. The current is_same_physical_destination (dlt/dataset/dataset.py:426) compares str(config) — which calls str(credentials), rendering the full connection URL including username, port, and database:

def is_same_physical_destination(dataset1, dataset2):
    return str(dataset1.destination_client.config) == str(dataset2.destination_client.config)

This is too strict for destinations that support cross-database joins (Snowflake, BigQuery, MSSQL, ClickHouse, Databricks, Dremio, Athena) — they return False even when tables can be joined across databases on the same server/account.

It is semantically wrong for the filesystem destination where joinability comes from the in-process DuckDB query engine, not data location.

A fingerprint() method exists on DestinationClientConfiguration that hashes a location identifier (e.g. host), but it is not used by is_same_physical_destination and is missing on several destinations (DuckDB, Athena, SQLAlchemy, Fabric).

What to implement

1. Add location() to DestinationClientConfiguration

Returns the unhashed physical location string — the identity of where this destination lives. Replaces per-destination fingerprint() overrides. fingerprint() becomes derived on the base class and is no longer overridden per-destination:

def location(self) -> str:
    """Returns the physical location of this destination (e.g. host, bucket URL, project id)"""
    return ""

def fingerprint(self) -> str:
    """Returns a hash of location()"""
    loc = self.location()
    return digest128(loc) if loc else ""

2. Add can_join_with(other) to DestinationClientConfiguration

Determines if tables from this destination can be joined with tables from other in a single query. This reflects the dlt destination's current capability, not the theoretical capability of the underlying database engine (e.g. DuckDB supports ATTACH for cross-database joins, but dlt doesn't implement it yet).

Default implementation compares locations. Destinations override for special semantics (e.g. filesystem, Postgres).

def can_join_with(self, other: "DestinationClientConfiguration") -> bool:
    """Check if tables from this destination can be joined with tables from other in a single query"""
    self_loc = self.location()
    other_loc = other.location()
    if not self_loc or not other_loc:
        return False
    return self_loc == other_loc

3. Update is_same_physical_destination in dlt/dataset/dataset.py

Delegate to can_join_with on the config.

Join compatibility per destination

location() is the physical identity of the destination. can_join_with determines joinability — it defaults to comparing locations but can be overridden.

Destination location() can_join_with rule Notes
Postgres host:port same host:port/database No cross-DB joins, override needed
Redshift host:port same host:port/database No cross-DB joins, override needed
Snowflake host (account identifier) same location (default) Cross-DB within account
BigQuery project_id same location (default) Cross-dataset within project
MSSQL host same location (default) Cross-DB via db.schema.table
Synapse host (inherited from MSSQL) same location (default) Cross-DB via db.schema.table
ClickHouse host same location (default) Cross-DB via db.table
Databricks server_hostname same location (default) Cross-catalog via Unity Catalog
Athena aws_data_catalog same location (default) Cross-DB within Glue catalog
Dremio host same location (default) Federation engine, cross-source
DuckDB database file path or ":memory:" same location (default) No ATTACH support in dlt currently
MotherDuck access token (account) same location (default) Cross-DB within account
Filesystem scheme://netloc or "" for local always True vs other filesystem, override needed In-process DuckDB scans files directly
DuckLake catalog connection string (no credentials) + ducklake_name same location (default) Catalog is the identity, not storage (paths are relative, bucket is swappable)

SQLAlchemy — dialect-dependent can_join_with

SQLAlchemy's location() is host:port (or file path for sqlite). can_join_with depends on the backend dialect:

Backend can_join_with rule
postgresql same host:port/database (no cross-DB)
mysql same host:port (cross-DB supported)
mssql same host:port (cross-DB supported)
oracle same host:port (cross-DB supported)
db2 same host:port (cross-DB supported)
sqlite same database file path
unknown same host:port/database (conservative default)

Missing location() implementations

These destinations currently have no fingerprint() override and return "":

Destination Needed location()
DuckDB database file path or ":memory:"
Athena aws_data_catalog identifier
SQLAlchemy host:port (or file path for sqlite)
Fabric host/workspace identifier from credentials

Existing corrections needed

Destination Current fingerprint() Problem Fix
DuckLake digest128(storage bucket) Storage is swappable via OVERRIDE_DATA_PATH; catalog is the real identity location() = catalog connection string (no credentials) + ducklake_name

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions